Remove all special characters but not accented letters-CodePudding

I need to delete from a string all symbols exept accentend letters in GO. My code instead delete all symbols included accented letters:

str := "cafè!?"
reg, err := regexp.Compile(`[^\w]`)
str := reg.ReplaceAllString(str, " ")

I expect the following output:

cafè

But the output with my code is:

caf

I want to include è, é, à, ò, ì (and of course all letters from a to z and numbers from 0 to 9)

How can I do? Thanks for your help

CodePudding user response：

To include è, é, à, ò, ì, just add them to the regex: [^\wèéàòìÈÉÀÒÌ]

You might also use [^\d\p{Latin}], but that'll match more characters.

\d is for digits and \p{Latin} is a Unicode class for all Latin characters, including all diacritics.

For example:

re := regexp.MustCompile(`[^\d\p{Latin}]`)
fmt.Println(re.ReplaceAllString(`Test123éËà-ŞŨğБла通用`, ""))

Will print:

Test123éËàŞŨğ

CodePudding user response：

You can use a Unicode text segmentation library to iterate over grapheme clusters, and check that the first rune in each grapheme cluster has the right category (letter or digit).

import (
    "strings"
    "unicode"

    "github.com/rivo/uniseg"
)

func stripSpecial(s string) string {
    var b strings.Builder
    gr := uniseg.NewGraphemes(s)
    for gr.Next() {
        r := gr.Runes()[0]
        if unicode.IsLetter(r) || unicode.IsDigit(r) {
            b.WriteString(gr.Str())
        }
    }
    return b.String()
}

The code works by first breaking the string into grapheme clusters,

"cafè!?" -> ["c", "a", "f", "è", "!", "?"]

Each grapheme cluster may contain multiple Unicode code points. The first code point determines the type of character, and the remaining code points (if any) are accent marks or other modifiers. So we filter and concatenate:

["c", "a", "f", "è"] -> "cafè"

This will pass through any accented or unaccented letters and digits, no matter how they are normalized, and no matter what accents (including z̶̰̬̰͈̅̒̚͝å̷̢̡̦̼̥̘̙̺̩̮̱̟̳̙͂́̇̓̉́͒̎͜ḽ̷̢̣̹̳̊̋ͅg̵̙̞͈̥̳̗͙͚͛̀͘o̴̧̟̞̞̠̯͈͔̽̎͋̅́̈̅̊̒ text). It will exclude certain characters like zero-width joiners which will cause words in certain languages to get mangled... so if you care about an international audience, you may want to review if your audience uses zero-width joiners. So, this will mangle certain scripts like Devanagari.