Home > Software engineering >  How to combine multiple Unicode character properties in JavaScript?
How to combine multiple Unicode character properties in JavaScript?

Time:03-13

There are \p{Script=Latin} (also can be written as \p{sc=Latin}) and \p{Uppercase}.

But there is currently no way to select an intersection of multiple sets like /^([ \p{Script=Latin} & \p{Uppercase} ])/ in Perl ≥5.18 or \p{Script=Latin,Uppercase}.

So the task is to find a workaround.

Example input:

const input = [
'License: GPL!',
'License: WÐFPL!',
'License: None!',
]

Example output: ['GPL', 'WÐFPL']

The answer could use use a regexp that looks like this for example: /^License:\s*(?<abbr>\p{Script=Latin,Uppercase} )!$/u

CodePudding user response:

const input = [
'License: GPL!',
'License: WÐFPL!',
'License: None!',
]
const regexp = /^License:\s*(?<abbr>(?:(?![ƗØ])(?=\p{Uppercase})\p{sc=Latin}) )!$/u
console.log(input.map(str => str.match(regexp)?.groups?.abbr).filter(Boolean))

Explanation:

^
License:
\s*
(?<abbr>   // named capture groups
    (?:
        // A negative look-ahead assertion.
        // Exclusion of Ɨ and Ø was not required by the question;
        // this line is here to provide more examples.
        (?![ƗØ])

        // A look-ahead assertion (looks into the future,
        // and then always goes back to the former position)
        (?=\p{Uppercase})

        \p{sc=Latin}
    ) 
)
!
$

CodePudding user response:

There's no ideal workaround to do that except if you want the intersection of predefined character classes. All you have to do is to use a negation and negated character classes:

^License:\s*([^\P{Script=Latin}\P{Uppercase}] )!

It is simple set logic: A ∩ B = !!(A ∩ B) = !(!A ∪ !B)

  • Related