I'm trying to figure out how to handle filenames in Tamil. I need to shorten them like this: "foobar.gif" -> "foo...gif".
I've learned today that some languages use more than one char to represent a letter and I discovered that C# has the Rune concept.
I can't get this to work with Tamil.
Take "தமிழ்.gif" for example:
I had hoped that "தமிழ்.gif".Length should be 6 but it's 9:
How can I get do a proper substring like "தமிழ்.gif".Substring(2) => "தமி" instead of "தம".
What am I missing?
CodePudding user response:
This has to do with surrogate pairs, which are pairs of char
that represent "single" characters in Unicode.
See these question regarding Surrogate Pairs: What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?
Is String.Replace(string,string) Unicode Safe in regards to Surrogate Pairs?
When dealing with characters that are actually longer than a single character, you'll have to find the indices of the string arrays that are contained within your current string array.
I should add, because of this, you'll have to create some "Unicode-Safe" methods for removal of characters or finding the indices, otherwise you may end up removing "half" of a valid Unicode character and be left with invalid Unicode