Home > Back-end >  Czech characters in regexp search
Czech characters in regexp search

Time:10-12

I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:

    r := regexp.MustCompile("(?i)\\by\\w \\b")
    text := "x yž z"
    matches := r.FindAllString(text, -1)
    fmt.Println(matches) //have [], want [yž]

I studied Go's regexp syntax: https://github.com/google/re2/wiki/Syntax

but I don't know, how to define czech language characters there? Using \w just matches ASCII characters, not Czech UTF characters.

Can you please help me?

CodePudding user response:

In RE2, both \w and \b are not Unicode-aware:

\b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other)
\w word characters (== [0-9A-Za-z_])

A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:

package main

import (
    "fmt"
    "strings"
    "regexp"
)

func main() {
    output := []string{}
    r := regexp.MustCompile(`\P{L} `)
    str := "x--  yž,,,.z..00"
    words := r.Split(str, -1)
    for i := range words {
        if len(words[i]) > 0 && (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
            output = append(output, words[i])
        }
    }
    fmt.Println(output)
}

See the Go demo.

Note that a naive approach like

package main

import (
    "fmt"
    "regexp"
)

func main() {
    output := []string{}
    r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
    str := "x--  yž,,,.z..00..."
    matches := r.FindAllStringSubmatch(str, -1)
    for _, v := range matches {
        output = append(output, v[1])
    }
    fmt.Println(output)
}

won't work in case you have match1,match2 match3 like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.

A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say

package main

import (
    "fmt"
    "regexp"
)

func main() {
    output := []string{}
    r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
    str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
    matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L} `).ReplaceAllString(str, `$0 `), -1)
    for _, v := range matches {
        output = append(output, v[1])
    }
    fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]

See this Go demo.

Here, regexp.MustCompile(`\P{L} `).ReplaceAllString(str, `$0 `) adds a space after all chunks of non-letter chars.

CodePudding user response:

Will a hardcoded regex like this help:

regexp.MustCompile("[A-Za-z]*[ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž] [ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽžA-Za-z]*[ ] ")

full example here

  • Related