I am currently working on checking if a string contains accented characters. for example
hellohello ---> return true
helloéèhello ---> return false because the text contains accent characters
can anyone help me with the regex
thank you
CodePudding user response:
If you are trying to check for emails check out the answer that alex suggested but if you just want to check the above two test cases here it is.
Note this is not for testing valid email just some valid string
^[a-zA-Z@._]*[a-zA-z0-9]$
^
Starting with
[a-zA-Z0-9@._] lowercase, uppercase, digits,
@,
.&
_are valid characters meaning accent and other symbols are not valid
*greedy select matching previous character sets
[a-zA-z0-9]$` ends with alphanumeric data.
Example
hello@gmail.com ---> return true
helloégmail ---> return false because the text contains accent characters
hello1 -> true
test@ -> false as it should end with alphanumeric character
CodePudding user response:
I am not sure why you want a regex based answer. But if that is not absolutely necessary, then here is how you can do detect it.
(Disclaimer: I am not familiar with European languages that have accented alphabets, so I may have missed some linguistic aspect here. Also, I am more familiar with Java and the JavaScript here may not be optimal.)
ASCII
If your text is ASCII, then I know no other way than looping through the character array and comparing its ASCII value to see if it is one of the accented characters. You can loop through from 1
to 255
and print the characters.
The accented characters, as I see, start from 192
onwards. However, not all characters beyond this are, so you will have to compare against the right set.
Here is a pseudocode that shows what I mean. (I am not skilled at JavaScript.)
/* This array has to be prepared by looking at all ASCII characters till 255. */
char[] accented = new char[]{ (char) 192, (char) 193, ... };
for( let c of Array.from( 'helloéèhello' ) ){
if( isPresentIn( c, accented ) ){
console.log( "Accented chars present" )
break;
}
}
Unicode
If this is a Unicode text, there is an indirect way to do this using normalization of Unicode characters. In Unicode, accented characters are usually composite characters. So, you can decompose the character and check if it has a component greater than code point 256.
To understand it in detail, you may go through the description at https://www.unicode.org/reports/tr15/tr15-23.html.
This is not perfect, but will be a good guide for you to come up with a more complete design.
Decomposing in JavaScript:
'helloéèhello'.normalize( 'NFD' )
Eg, é
decomposes into e
and code point 768, è
decomposes into e
and code point 769.
Note the difference in the characters without and after normalization.
Array.from( 'helloéèhello'.normalize( 'NFD' ) )
(14) ['h', 'e', 'l', 'l', 'o', 'e', '́', 'e', '̀', 'h', 'e', 'l', 'l', 'o']
Array.from( 'helloéèhello' )
(12) ['h', 'e', 'l', 'l', 'o', 'é', 'è', 'h', 'e', 'l', 'l', 'o']