Home > Enterprise >  Javascript regex validation for non latin characters with few few symbols whitelist
Javascript regex validation for non latin characters with few few symbols whitelist

Time:09-14

I'm trying to create a validation rules for username in two steps:

  1. Detect if strings contains any non latin characters. All non albhabetic symbols/numbers/whitespaces are allowed.
  2. Detect if string contains any symbols which are not in the whitelist (' - _ `). All latin/non latin characters/numbers/whitespaces are allowed.

I thought it would be easy, but I was wrong...

  1. For the first case I've tried to remove latin characters/numbers/whitespaces from the string:

str.replace(/[A-Za-z0-9\s]/g, '')

With such rule from "Xxx z 88A ююю 4$??!!" I will get "ююю$??!!". But how to remove all symbols ("ююю" should stay)?

  1. For the second case I've tried to remove latin characters/numbers/whitespaces/symbols from whitelist(' - _ `) with str.replace(/[A-Za-z0-9-_`\s]/g, ''), but I don't know hot to remove non latin characters.

Summary: My main problem is to detect non latin characters and separate them from special symbols.

UPDATE: Ok, for my second case I can use:

str.replace(/[\u0250-\ue007]/g, '').replace(/[A-Za-z0-9-_`\s]/g, '')

It works, but looks dirty... Pardon for backticks.

CodePudding user response:

For the first problem, eliminating a-z, 0-9, whitespace, symbols and puncutation, you need to know some unicode tricks.

  1. you can reference unicode sets using the \p option. Symbols are S, punctuation is P.

  2. to use this magic, you need to add the u modifier to the regex.

That gives us:

/([a-z0-9]|\s|\p{S}|\p{P})/giu

(I added the i because then I don't have to write A-Z as well as a-z.)

Since you have a solution for your second problem, I'll leave that with you.

CodePudding user response:

The 2 two cases could be solved as follows ...

  • The first case boils down to ... "allow just non latin letters" ... which could be achieved by ...

    • removing any non letter character sequence ... /[^\p{L}] /gu
    • and then removing any latin character sequence .../[a-zA-Z] /g
  • The second case allows "just any of letter, number and whitespace as well as underscore and dash" ... which gets achieved best by ...'

    • removing any character sequence which contains neither letter/\p{L} nor number/\p{N} nor whitespace/\p{Z} nor underscore nor dash ... /[^\p{L}\p{N}\p{Z}_-] /gu

In addition the OP could read about regex unicode escapes.

const testSample = 'Xxx z_88A-ююю 4$??!!';

console.log(
  '1st case ... allow just non latin letters ...', {
    testSample,
    result: testSample
      // remove any non letter character sequence ...
      .replace(/[^\p{L}] /gu, '')
      // ... then remove any latin character sequence.
      .replace(/[a-zA-Z] /g, ''),
  },
);
console.log(
  '2nd case ... allow any letter, number and whitespace as well as underscore and dash ...', {
    testSample,
    result: testSample
      // remove any character sequence which contains neither letter/`\p{L}`
      // nor number/`\p{N}` nor whitespace/`\p{Z}` nor underscore nor dash.
      .replace(/[^\p{L}\p{N}\p{Z}_-] /gu, ''),
  },
);
.as-console-wrapper { min-height: 100%!important; top: 0; }

CodePudding user response:

So instead of matching the "forbidden" characters by specifying them individually of as range, you could simply invert the match of the allowed characters:

For case one this would be (as I understood it)

[^A-Za-z0-9,.%$^#@$_-]

That little ^ as first character of the character class (inside the []) inverts the rest of the character class, meaning: match anything except those characters.

Just make sure to keep the - as last character inside the character class when you want to match/not match literally that one and don't define a range.

And for case two you could similarly specify only the allowed characters. Unfortunately I did not really understand, what you meant with "whitelist" and where you want to remove or keep what.

  • Related