Czech characters in regexp search-CodePudding

I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:

    r := regexp.MustCompile("(?i)\\by\\w \\b")
    text := "x yž z"
    matches := r.FindAllString(text, -1)
    fmt.Println(matches) //have [], want [yž]

I studied Go's regexp syntax: https://github.com/google/re2/wiki/Syntax

but I don't know, how to define czech language characters there? Using \w just matches ASCII characters, not Czech UTF characters.

Can you please help me?

CodePudding user response：

In RE2, both \w and \b are not Unicode-aware:

\b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other)
\w word characters (== [0-9A-Za-z_])

A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:

package main

import (
    "fmt"
    "strings"
    "regexp"
)

func main() {
    output := []string{}
    r := regexp.MustCompile(`\P{L} `)
    str := "x--  yž,,,.z..00"
    words := r.Split(str, -1)
    for i := range words {
        if len(words[i]) > 0 && (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
            output = append(output, words[i])
        }
    }
    fmt.Println(output)
}

See the Go demo.

Note that a naive approach like

package main

import (
    "fmt"
    "regexp"
)

func main() {
    output := []string{}
    r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
    str := "x--  yž,,,.z..00..."
    matches := r.FindAllStringSubmatch(str, -1)
    for _, v := range matches {
        output = append(output, v[1])
    }
    fmt.Println(output)
}

won't work in case you have match1,match2 match3 like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.

A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say

package main

import (
    "fmt"
    "regexp"
)

func main() {
    output := []string{}
    r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
    str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
    matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L} `).ReplaceAllString(str, `$0 `), -1)
    for _, v := range matches {
        output = append(output, v[1])
    }
    fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]

See this Go demo.

Here, regexp.MustCompile(`\P{L} `).ReplaceAllString(str, `$0 `) adds a space after all chunks of non-letter chars.

CodePudding user response：

Will a hardcoded regex like this help:

regexp.MustCompile("[A-Za-z]*[ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž] [ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽžA-Za-z]*[ ] ")

full example here