The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string&
but the user/developer can create a substring by calling std::substr
.
Another way would be create a wrapper around const std::string&
and expose only the string through getters.
Is this even possible?
CodePudding user response:
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string
or std::u8string
, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
CodePudding user response:
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4)
(for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à
would normally be encoded as an a
followed by a "combining grave accent" (U 0300).
The a
and the U 0300 are each a code point. When encoded in UTF-8, the a
would be encoded in a single byte and the U 0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for all the typical western European languages).