I was trying to find words using regex in Kotlin. Here is a snippet of sample code
val possibleString = "#This is a comment"
val regex = "(?<=[ \t\n$PUNC])(\\w )".toRegex() //PUNC is another char sequence of punctuation
val matcher1 = regex.find(possibleString)
val matcher2 = regex.find(possibleString,1)
println(matcher1?.value) // this
println(matcher2?.value) //this
The value of matcher 1 makes sense to me, which yields this. However, why matcher2 also return this? if the start index is 1, don't we start from 'T', and output "is" instead?
I'm wondering why is the case. Do the matcher still scans for the string before index?
If this is the case, I know I could passing substring staring from index 1 to get the desired output. However, consider the possibilities of large chunks of text, generate multiple substring seems waste of memory. So, is there any efficient workaround?
Thanks!
CodePudding user response:
If you begin your search at "T", you will find This
, because the startIndex
is inclusive. A match will be found unless it starts before the startIndex
. If it starts on the startIndex
, it will still be found.
I suspect that your misunderstanding might be thinking that find
would ignore the the first #
, because it is before the startIndex
. This is not true. startIndex
only says where to start - lookbehinds don't suddenly break because you started at a later index.
Your desired behaviour isn't how lookbehinds work, so the workaround would be to use a group instead.
val regex = "[ \t\n$PUNC](\\w )".toRegex()
val matcher1 = regex.find(possibleString)
val matcher2 = regex.find(possibleString,1)
println(matcher1?.groups?.get(1)?.value) // This
println(matcher2?.groups?.get(1)?.value) // is
You should think of the startIndex
as the minimum index of the beginning of a match. If you want the match is
for example, you can start searching from anywhere between index 2 (inclusive) and 6 (inclusive), since is
starts on index 6:
println(regex.find(possibleString, 2).value) // is