Home > Net >  Why does codecvt_utf8 give hex value as ffffff appended in beginning?
Why does codecvt_utf8 give hex value as ffffff appended in beginning?

Time:10-27

for this code -

int main()
{
    std::wstring wstr = L"é";
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;

    std::stringstream ss;
    ss << std::hex << std::setfill('0');

    for (auto c : myconv.to_bytes(wstr))
    {
        ss << std::setw(2) << static_cast<unsigned>(c);
    }
    string ssss = ss.str();
    cout << "ssss = " << ssss << endl;



Why does this print ffffffc3ffffffa9 instead of c3a9?

Why does it append ffffff in beginning? If you want to run it in ideone - https://ideone.com/qZtGom

CodePudding user response:

c is of type char, which is signed on most systems. Converting a char to an unsigned causes value to be sign-extended.

Examples:

  • char(0x23) aka 35 --> unsigned(0x00000023)
  • char(0x80) aka -128 --> unsigned(0xFFFFFF80)
  • char(0xC3) aka -61 --> unsigned(0xFFFFFFc3)

[edit: My first suggestion didn't work; removed]

You can cast it twice: ss << std::setw(2) << static_cast<int>(static_cast<unsigned char>(c));

The first cast gives you an unsigned type with the same bit pattern, and since unsigned char is the same size as char, there is no sign extension.

But if you just output static_cast<unsigned char>(c), the stream will treat it as a character, and print .. something .. depending on your locale, etc.

The second cast gives you an int, which the stream will output correctly.

  • Related