I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:
r := regexp.MustCompile("(?i)\\by\\w \\b")
text := "x yž z"
matches := r.FindAllString(text, -1)
fmt.Println(matches) //have [], want [yž]
I studied Go's regexp syntax: https://github.com/google/re2/wiki/Syntax
but I don't know, how to define czech language characters there? Using \w
just matches ASCII characters, not Czech UTF characters.
Can you please help me?
CodePudding user response:
In RE2, both \w
and \b
are not Unicode-aware:
\b
at ASCII word boundary («\w
» on one side and «\W
», «\A
», or «\z
» on the other)
\w
word characters (==[0-9A-Za-z_]
)
A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:
package main
import (
"fmt"
"strings"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`\P{L} `)
str := "x-- yž,,,.z..00"
words := r.Split(str, -1)
for i := range words {
if len(words[i]) > 0 && (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
output = append(output, words[i])
}
}
fmt.Println(output)
}
See the Go demo.
Note that a naive approach like
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
str := "x-- yž,,,.z..00..."
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
won't work in case you have match1,match2 match3
like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.
A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L} `).ReplaceAllString(str, `$0 `), -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]
See this Go demo.
Here, regexp.MustCompile(`\P{L} `).ReplaceAllString(str, `$0 `)
adds a space after all chunks of non-letter chars.
CodePudding user response:
Will a hardcoded regex like this help:
regexp.MustCompile("[A-Za-z]*[ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž] [ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽžA-Za-z]*[ ] ")
full example here