Home > Enterprise >  How do I parse each letter of an Arabic text string in .NET C#?
How do I parse each letter of an Arabic text string in .NET C#?

Time:11-25

Why while I loop over every single char of this .NET C# string Arabic text: ٻڠڣڟگگښڏ at position 13th I get the wrong letter? 'ٻ' instead of 'ڏ'.

How do I fix it?

enter image description here

CodePudding user response:

Arabic is written right-to-left. The arrow points to the character at offset 20.

You're pointing to the last

 0: U 0041 LATIN CAPITAL LETTER A
 1: U 0072 LATIN SMALL LETTER R
 2: U 0061 LATIN SMALL LETTER A
 3: U 0062 LATIN SMALL LETTER B
 4: U 0069 LATIN SMALL LETTER I
 5: U 0063 LATIN SMALL LETTER C
 6: U 0020 SPACE
 7: U 0074 LATIN SMALL LETTER T
 8: U 0065 LATIN SMALL LETTER E
 9: U 0078 LATIN SMALL LETTER X
10: U 0074 LATIN SMALL LETTER T
11: U 003A COLON
12: U 0020 SPACE
13: U 067B ARABIC LETTER BEEH
14: U 06A0 ARABIC LETTER AIN WITH THREE DOTS ABOVE
15: U 06A3 ARABIC LETTER FEH WITH DOT BELOW
16: U 069F ARABIC LETTER TAH WITH THREE DOTS ABOVE
17: U 06AF ARABIC LETTER GAF
18: U 06AF ARABIC LETTER GAF
19: U 069A ARABIC LETTER SEEN WITH DOT BELOW AND DOT ABOVE
20: U 068F ARABIC LETTER DAL WITH THREE DOTS ABOVE DOWNWARDS

And that's not going into the fact that a grapheme (visual element) can be composed from multiple Unicode Code Points, and that C# uses surrogate pairs and thus multiple char values to represent some Unicode Code Points.

For example, there exists a script where the following grapheme exists:A glyph with a dot below it

  • The grapheme is formed from the Unicode Code Points U 11A0B followed by U 11A33.
  • C# encodes U 11A0B as chars 0xD806 followed by 0xDE0B.
  • C# encodes U 11A33 as chars 0xD806 followed by 0xDE33.

So the grapheme would be represented by the following sequence of four char values!

  1. 0xD806
  2. 0xDE0B
  3. 0xD806
  4. 0xDE33

And no, it's not just for archaic languages. "

  • Related