Home > Enterprise >  include unicode character within long regex
include unicode character within long regex

Time:07-15

I have a regex:

/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġg̶̃čḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶] /gm

which works great except there is one character I can't include (or that doesn't seem to work as expected when included). The character is (within) the last digit of the regex:

ś̶ // [it makes the cross-through (not easily visible in some fonts), in unicode it is 'COMBINING LONG STROKE OVERLAY' (U 0336)]

my regex is capturing the character but splitting any word that contains it:

"mokk̇ś̶ḣô".match(/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġčḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶g̶̃] /gm)

// == ['mokk', 'ś̶ḣô']

I've heard about Unicode Property Escapes using \p{UnicodePropertyValue} with a u flag. Would that be useful here?

CodePudding user response:

It doesn't seem to be related to ś char. As you said your self, it's being captured. The reason for the splitting is the lack of another char: k̇.

console.log("mokk̇ś̶ḣô".match(/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġčḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶g̶̃] /gm)
)
console.log("mokk̇ś̶ḣô".match(/[a-zA-Zɑôáīúȑìêɑ͡iɑ͡uŋġḧn̐ƞġčḣñt́d́ŕŕńȶv̈m̈ᵯǰɏæǽÿẇẏs̃śś̶k̇g̶̃] /gm)
)

  • Related