Convert regexp.FindStringIndex results to character indices-CodePudding

The regexp.FindStringIndex(s string, n int) []int function returns byte indices of matches. In simple scenarios, these locations correspond to the "character position" in the string. However, certain characters foil this assumption. For example:

package main

import (
    "fmt"
    "regexp"
)

var (
    re   = regexp.MustCompile(`bbb`)
    str1 = "aaa bbb ccc"
    str2 = "aaa✌️bbb ccc"
)

func main() {
    fmt.Println(str1, re.FindStringIndex(str1))
    fmt.Println(str2, re.FindStringIndex(str2))
}

Result:

aaa bbb ccc [4 7]
aaa✌️bbb ccc [9 12]

Why is this and how could one convert the FindStringIndex result to locate characters within a string rather than bytes?

CodePudding user response：

Without repeating lengthy articles on the subject of character encodings, the simplicity is that some characters have more complex data representations and require more bytes than others.

Although Go has the concept of a rune which is a single unicode code point, that's not necessarily equivalent to a "character". The correct term for "user-perceived character" is actually a grapheme or grapheme cluster.

Now that we have our terminology straight, the task is to map the byte indices from FindStringIndex into grapheme cluster indices. I don't know of a way to do this in the Go standard library, but I found a package called Uniseg that allows us to identify grapheme clusters within a string. From the readme:

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster".

This package provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

The readme also contains excellent examples of strings vs. code points vs. graphemes that help demystify the subject.

So how do we use this package to solve our problem?

package main

import (
    "fmt"
    "regexp"

    "github.com/rivo/uniseg"
)

var (
    re   = regexp.MustCompile(`bbb`)
    str1 = "aaa bbb ccc"
    str2 = "aaa✌️bbb ccc"
)

func main() {
    fmt.Println(str1, re.FindStringIndex(str1), mapCoords(str1, re.FindStringIndex(str1)))
    fmt.Println(str2, re.FindStringIndex(str2), mapCoords(str2, re.FindStringIndex(str2)))
}

func mapCoords(s string, byteCoords []int) (graphemeCoords []int) {
    graphemeCoords = make([]int, 2)
    gr := uniseg.NewGraphemes(s)
    graphemeIndex := -1
    for gr.Next() {
        graphemeIndex  
        a, b := gr.Positions()
        if a == byteCoords[0] {
            graphemeCoords[0] = graphemeIndex
        }
        if b == byteCoords[1] {
            graphemeCoords[1] = graphemeIndex   1
            break
        }
    }
    return
}

Result:

aaa bbb ccc [4 7] [4 7]
aaa✌️bbb ccc [9 12] [4 7]

Playground

CodePudding user response：

This is because strings in Go are (by default/convention) encoded in UTF-8, and the character you wrote occupies more than one byte in UTF-8 encoding.

This follows the normal convention for Go, where offsets into strings act the same as they do for byte slices (i.e. they are byte offsets, not character offsets). This is not specific to the regexp package, it's how strings work in Go in general.

If you really wish to determine the offset in characters, you can use one of methods from the utf8 package to count each character. Or, the range operator also does this for you from its built-in behavior. This snippet will determine the character offset in a string given a byte offset:

byteOffset := 6
cc := 0
for i := range str {
    if i >= byteOffset {
        return cc
    }
    cc  
}

However, it is important to understand that normally you don't need to count characters. The general idea is that strings in Go are treated as opaque for as long as possible and the utf-8 encoding is done "lazily" only when you need to for specific string operations that require it. The odds are, whatever code you wrote after this which requires a character offset can be refactored to good/better effect to use a byte offset instead.