Why is array of characters(char type) working with unicode characters (c )?-CodePudding

When i wrte this code :

using namespace std;

int main(){
    char x[] = "γεια σας";
    cout << x;
    return 0;
}

I noticed that compiler gave me output which i excepted γεια σας Although the type of array is char, That is, it should just accept ASCII characters.

So why compiler didn't give error?

CodePudding user response：

Here's some code showing what C really does:

#include <iostream>
#include <iomanip>

using namespace std;

int main(){
    char x[] = "γεια σας";
    cout << x << endl;
    
    auto len = strlen(x);
    cout << "Length (in bytes): " << len << endl;
    for (int i = 0; i < len; i  )
        cout << "0x" << setw(2) << hex << static_cast<int>(static_cast<unsigned char>(x[i])) << ' ';
    cout << endl;
    return 0;
}

The output is:

γεια σας
Length (in bytes): 15
0xce 0xb3 0xce 0xb5 0xce 0xb9 0xce 0xb1 0x20 0xcf 0x83 0xce 0xb1 0xcf 0x82

So the string takes up 15 bytes and is encoded as UTF-8. UTF-8 is a Unicode encoding using between 1 and 4 bytes per character (in the sense of the smallest unit you can select with the text cursor). UTF-8 can be saved in a char array. Even though it's called char, it basically corresponds to a byte and not what we typically think of as a character.

CodePudding user response：

What you have got with 99.99% likelihood is Unicode code points stored in UTF-8 format. Each code point is turned into one to four chars.

Unicode in the ASCII range is turned into one ASCII byte from 0x00 to 0x7f. There are 2048 code points translated to two bytes with the binary pattern 110x xxxx 10yy yyyy, 65536 are translated to three code points 1110 xxxx 10yy yyyy 10zz zzzz, and the rest becomes four chars 1111 0xxx 10yy yyyy 10zz zzzz 10uu uuuu.

Most C and C string functions work just fine with UTF-8. An exception is strncpy or strncat which could create an incomplete code point. The old Interview problem “reverse the strings in a character” becomes more complicated because reversing the bytes inside a code point produces nonsense.

CodePudding user response：

Although the type of array is char, That is, it should just accept ASCII characters.

You've assumed wrongly.

Unicode has several transformation formats. One popular such format is UTF-8. The code units of UTF-8 are 8 bits wide, as implied by the name. It is always possible to use char to represent the code units of UTF-8, because char is guaranteed to be at least 8 bits wide.