How to make charArray that doesn't seperate diacritcs?-CodePudding

I'm trying to seperate a Hebrew word into letters in C#, but ToCharArray() seperates the diacritics as if they're seperate letters (which they're not). I'm fine with either keeping the letters whole with their diacritics, or worst case getting rid of the diacritics altogether.

Example: כֶּלֶב is coming out as 6 different letters

CodePudding user response：

The StringInfo class knows about base characters and accents and can handle this:

string s = "כֶּלֶב";
System.Globalization.TextElementEnumerator charEnum = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
    Console.WriteLine(charEnum.GetTextElement());
}

will print 3 lines:

כֶּ
לֶ
ב

CodePudding user response：

Strings in C# are stored as arrays of char. That is to say: they are arrays of UTF-16 code units. ToCharArray() just returns that UTF-16 array. And it sometimes takes multiple code units to form a single "symbol".

Would char.GetUnicodeCategory(char) be of any help? Maybe you could split that array on OtherLetter or something (not familiar with Hebrew)?

const string word = "כֶּלֶב";
Console.WriteLine(word.Length);
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(x => (int)x)));
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(char.GetUnicodeCategory)));

Output:

6
1499 1468 1462 1500 1462 1489
OtherLetter NonSpacingMark NonSpacingMark OtherLetter NonSpacingMark OtherLetter