Home > front end >  RegExp in JavaScript: include one Unicode group and exclude another one
RegExp in JavaScript: include one Unicode group and exclude another one

Time:01-30

I have to write some regexp. And I know where is a lot of variants. But I find the thing that I don't understand, so I want to ask.

I had this regexp: /^[\p{Letter} \-'`’–] $/u It has to allow Latin and symbols like Ü, Æ ect.

And it has bug - it allows also Ukrainian (Cyrillic) symbols "і Ї". So I want to add the rule "exclude Cyrillic". But I don't know how to do it.

I tried /^[\p{Letter} \-'`’–][^\p{sc=Cyrillic}] $/u; but the last part [^] means all except Cyrillic ((

Please, don't say me to rewrite regexp. I just want to learn how can I write "exclude" rule.

Thanks)

CodePudding user response:

In your /^[\p{Letter} \-'`’–] $/u regex, \p{Letter} matches any letter in the Unicode table. However, \p{Alphabetic} includes more letters, and you would like to use this Unicode category class if you planned to match any Unicode letter.

Since you only want to match Latin letters you should replace \p{Letter} with \p{sc=Latin} or \p{sc=Latn} (note the hyphen should be just used at the end of the character class, it is the cleanest way to use it here).

Note that the sc=, or Script= (this is for Script names, scx= or Script_Extensions= can be used for script extensions) prefix is required to work with those script names in Unicode category classes (see ECMAScript reference).

See a JavaScript demo:

const rx = /^[\p{sc=Latn} '`’–-] $/u;
console.log( rx.test("Вася-Пупкин’о") ); // => false
console.log( rx.test("Łukasz Ąłski") );  // => true
console.log( rx.test("Chloé Alméras") ); // => true

If you wanted to match any letters but Cyrillic ones, you would need to only add a negative lookahead like this:

const rx = /^(?:(?!\p{sc=Cyrl})[\p{Alphabetic} '`’–-]) $/u;
console.log( rx.test("Вася-Пупкин’о") ); // => false
console.log( rx.test("עֲדִינָה") );         // => true
console.log( rx.test("Łukasz Ąłski") );  // => true
console.log( rx.test("Chloé Alméras") ); // => true

See more about value aliases and canonical values for the Unicode properties (Script and Script_Extensions) here.

  •  Tags:  
  • Related