What is the regex for german strings with umlauts-CodePudding

Does anyone know, what the german regex for/ with umlauts is?

const string = 'Aktivitäten und Ausflugsziele in der Nähe von keyword.'

function getTags ( string ) {
   let tags = []
   string = string.toLocaleLowerCase()
   tags = string.match(/\b(\w )\b/g)
   return tags
}

This regex /\b(\w )\b/g does work perfect. However umlauts result in something like that..

[ 'aktivit', 'ten', 'und', 'ausflugsziele', 'in', 'der', 'he', 'von', 'keyword' ]

Now I tried to use this regex. /\b(\w [0-9a-zäöüÄÖÜ])\b/g, which seems to get closer to the expected result, but somehow I cant find the end of the word.

[ 'aktivitä', 'ten', 'und', 'ausflugsziele', 'in', 'der' 'nä', 'he', 'von', 'keyword' ]

Does anyone know the correct regex to fix german umlauts? Expected output:

[ 'aktivitäten', 'und', 'ausflugsziele', 'in', 'der, 'nähe', 'von', 'keyword' ]

CodePudding user response：

In a first iteration, you could try /\b([0-9a-zA-ZäöüÄÖÜß] )\b/g. Note that I added A-Z and ß to your character set and applied the (one or more reps) quantifier to it. This will fail for edge cases because the word boundaries only work for \w, which doesn't include Umlaute - Äpfel and bä won't work because they start/end with an Umlaut. Additionally, what about different languages? What about a french "è"? I propose the following, simple regex:

\p{L} - one or more Unicode letters; you might want to include the digit unicode property as well; note also that you need the unicode flag here.

You must get rid of the word boundaries. This is however not an issue because the greedy matching ensures no cutting inside words happens.

If you want to limit yourself to the German alphabet, you can use [a-zäöüß] instead (you have already lowercased the string).

CodePudding user response：

Change /\b(\w [0-9a-zäöüÄÖÜ])\b/g to this /[0-9a-zäöüÄÖÜ] /g. means find one or more characters from [0-9a-zäöüÄÖÜ] and space isn't in it so it stops when find first space and looks for another word.

const string = 'Aktivitäten und Ausflugsziele in der Nähe von keyword.'

function getTags(string) {
     let tags = []
     string = string.toLocaleLowerCase()
     tags = string.match(/[0-9a-zäöüÄÖÜ] /g)
      return tags
}

console.log(getTags(string))

CodePudding user response：

\w is strictly for ids, that is [a-zA-Z0-9_].

That being said, you can separate words by spaces, using [^\s] for a more general version of \w (anything that is not made of spaces), and using (?<=\s|^) and (?=\s|$) for replacements of \b. Meaning, there is a space, or a beginning of line before, and there is a space or a end of line after, respectively.

So, all together,

const string = 'Ich denke daß es gut ist. Aktivitäten und Ausflugsziele in der Nähe von keyword. Erdoğan had a ğ in his name.' 

function getTags ( string ) {
   let tags = []
   string = string.toLocaleLowerCase()
   tags = string.match(/(?<=\s|^)([^ ] )(?=\s|$)/g)
   return tags
}

console.log(getTags(string));

Note that it works even with other letters that the one we may think of at first (contrarily to solution based on some [äöüß...]) and also with words that starts with one of those non-ascii letter, contrarily to solutions that still use \b. Replacing \w is not enough if first or last letter of the word is non-\w letter.

CodePudding user response：

I suggest you switch to using Unicode regex which is widely supported by browsers today. That means all unicode characters are supported, not just Umlauts.

Use this regex:

/(?<=^|[^\p{L}\p{N}])[\p{L}\p{N}] (?=[^\p{L}\p{N}]|$)/gu

Note the unicode flag. Neither \w nor \b supports unicode characters, so we use unicode look arounds.

Explanation:

(?<=^|[^\p{L}\p{N}]) - look behind for start of string OR any character not being in unicode category {Letter} or {Number}

[\p{L}\p{N}] - match a character belonging to unicode category {Letter} OR {Number}, one or more

(?=[^\p{L}\p{N}]|$) - look ahead for any character not being in unicode category {Letter} or {Number} OR end of string