Home > Mobile >  Kotlin Regex find undesired behavior with startindex
Kotlin Regex find undesired behavior with startindex

Time:09-06

I was trying to find words using regex in Kotlin. Here is a snippet of sample code

val possibleString = "#This is a comment"
val regex = "(?<=[ \t\n$PUNC])(\\w )".toRegex() //PUNC is another char sequence of punctuation 
val matcher1 = regex.find(possibleString)
val matcher2 = regex.find(possibleString,1)
println(matcher1?.value) // this
println(matcher2?.value)  //this

The value of matcher 1 makes sense to me, which yields this. However, why matcher2 also return this? if the start index is 1, don't we start from 'T', and output "is" instead?

I'm wondering why is the case. Do the matcher still scans for the string before index?

If this is the case, I know I could passing substring staring from index 1 to get the desired output. However, consider the possibilities of large chunks of text, generate multiple substring seems waste of memory. So, is there any efficient workaround?

Thanks!

CodePudding user response:

If you begin your search at "T", you will find This, because the startIndex is inclusive. A match will be found unless it starts before the startIndex. If it starts on the startIndex, it will still be found.

I suspect that your misunderstanding might be thinking that find would ignore the the first #, because it is before the startIndex. This is not true. startIndex only says where to start - lookbehinds don't suddenly break because you started at a later index.

Your desired behaviour isn't how lookbehinds work, so the workaround would be to use a group instead.

val regex = "[ \t\n$PUNC](\\w )".toRegex()
val matcher1 = regex.find(possibleString)
val matcher2 = regex.find(possibleString,1)
println(matcher1?.groups?.get(1)?.value) // This
println(matcher2?.groups?.get(1)?.value) // is

You should think of the startIndex as the minimum index of the beginning of a match. If you want the match is for example, you can start searching from anywhere between index 2 (inclusive) and 6 (inclusive), since is starts on index 6:

println(regex.find(possibleString, 2).value) // is
  • Related