Home > Software engineering >  Is there a proper way to receive input from console in UTF-8 encoding?
Is there a proper way to receive input from console in UTF-8 encoding?

Time:03-17

When getting input from std::cin in windows, the input is apparently always in the encoding windows-1252 (the default for the host machine in my case) despite all the configurations made, that apparently only affect to the output. Is there a proper way to capture input in windows in UTF-8 encoding?

For instance, let's check out this program:

#include <iostream>

int main(int argc, char* argv[])
{
    std::cin.imbue(locale("es_ES.UTF-8"));
    std::cout.imbue(locale("es_ES.UTF-8"));

    std::cout << "ñeñeñe> ";
    std::string in; 
    std::getline( std::cin, in ); 
    std::cout << in; 

}

I've compiled it using visual studio 2022 in a windows machine with spanish locale. The source code is in UTF-8. When executing the resulting program (windows powershell session, after executing chcp 65001 to set the default encoding to UTF-8), I see the following:

PS C:\> .\test_program.exe
ñeñeñe> ñeñeñe
 e e e

The first "ñeñeñe" is correct: it display correctly the "ñ" caracter to the output console. So far, so good. The user input is echoed back to the console correctly: another good point. But! when it turns to send back the encoded string to the ouput, the "ñ" caracter is substituted by an empty space.

When debugging this program, I see that the variable "in" have captured the input in an encoding that it is not utf-8: for the "ñ" it use only one character, whereas in utf-8 that caracter must consume two. The conclusion is that the input is not affect for the chcp command. Is something I doing wrong?

UPDATE

Somebody have asked me to see what happens when changing to wcout/wcin:

std::wcout << u"ñeñeñe> ";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;

Behaviour:

PS C:\> .\test.exe
0,000,7FF,6D1,B76,E30ñeñeñe
 e e e

Other try (setting the string as L"ñeñeñe"):

ñeñeñe> ñeñeñe
 e e e

Leaving it as is:

std::wcout << "ñeñeñe> ";

Result is:

eee>

CodePudding user response:

This is the closest to the solution I've found so far:

int main(int argc, char* argv[])
{
    _setmode(_fileno(stdout), _O_WTEXT);
    _setmode(_fileno(stdin), _O_WTEXT);

    std::wcout << L"ñeñeñe";
    std::wstring in;
    std::getline(std::wcin, in);
    std::wcout << in;

    return 0;
}

The solution depicted here went in the right direction. Problem: both stdin and stdout should be in the same configuration, because the echo of the console rewrites the input. The problem is the writing of the string with \uXXXX codes.... I am guessing how to overcome that or using #define's to overcome and clarify the text literals

  • Related