Home > Software engineering >  Why codecvt can't convert unicode outside BMP to u16string?
Why codecvt can't convert unicode outside BMP to u16string?

Time:02-14

I am trying to understand C unicode and having this confused me now.

Code:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
using namespace std;

void trial1(){
    string a = "\U00010000z";
    cout << a << endl;
    u16string b;
    std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
    b = converter.from_bytes(a);
    u16string c = b.substr(0, 1);
    string q = converter.to_bytes(c);
    cout << q << endl;
}

void trial2(){
    u16string a = u"\U00010000";
    cout << a.length() << endl; // 2
    std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
    string b = converter.to_bytes(a);
}

int main() {
//    both don't work
//    trial1();
//    trial2();
    return 0;
}

I have tested that u16string can store unicode outside BMP as surrogate pairs, e.g. u"\U00010000" is stored with 2 char16_t.

So why std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter; doesn't work for both trial1 and trial2 and throw an exception?

CodePudding user response:

std::codecvt_utf8 does not support convertions to/from UTF-16, only UCS-2 and UTF-32. You need to use std::codecvt_utf8_utf16 instead.

  • Related