I have very little understanding of C streams and its handling of Unicode, trying to understand why code someone else wrote behaves in this way. I'd be very grateful if someone could explain to me what is going on.
MCVE:
#include <string>
#include <iostream>
int main() {
std::basic_string<wchar_t> line;
std::locale::global(std::locale("")); // This
std::wcout.imbue(std::locale("")); // This
std::wcin.imbue(std::locale("")); // This
for (;;) {
std::getline(std::wcin, line);
if (std::wcin.eof()) {
std::wcout << L"EOF" << std::endl;
break;
}
std::wcout << line << std::endl;
}
}
Sample input test.txt
:
( ) ライン
second line
EDIT: Hexdump of test.txt
:
$ xxd test.txt
00000000: 2820 2920 e383 a9e3 82a4 e383 b30a 7365 ( ) ..........se
00000010: 636f 6e64 206c 696e 650a cond line.
Results
On a CentOS server, this is the result (1):
$ ./a.out < test.txt
( ) ライン
second line
EOF
On my Mac though (2):
$ ./a.out < test.txt
( ) EOF
If I comment out the three marked locale lines, Redhat outputs (3):
$ ./a.out < test.txt
EOF
while Mac outputs (4):
$ ./a.out < test.txt
( ) ライン
second line
EOF
Questions
- Why does the second (2) result detect EOF mid-line? Where does the second space before
EOF
come from? (This result baffles me the most.) - Why does the third (3) result detect EOF immediately?
- Most importantly: What to do to always consistently get the first (1) or last result (4)?
Environment
Here is the environment for both machines:
CentOS Linux release 7.5.1804 (Core):
$ c --version
c (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
macOS Big Sur (version 11.6):
$ c --version
Apple clang version 12.0.5 (clang-1205.0.22.11)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
Bonus
One additional puzzle. If I change the input to this (i.e. just add two more spaces inside the parentheses):
( ) ライン
second line
the original (uncommented) code outputs this on Mac:
$ ./a.out < test.txt
( ) ララライ翕翕ン
second line
EOF
Those are not artifacts of a messed-up terminal; all those extra characters are actually there:
$ ./a.out < test.txt | xxd
00000000: 2820 2020 2920 2020 e383 a9e3 83a9 e383 ( ) ........
00000010: a9e3 82a4 e7bf b7e7 bfb7 e383 b30a 7365 ..............se
00000020: 636f 6e64 206c 696e 650a 454f 460a cond line.EOF.
Like... what?
EDIT In response to Giacomo Catenazzi's comment, I changed EOF
printing from char to wide, which did fix one weirdness regarding input. My core issue is with reading wcin
though, which proves to be unrelated.
EDIT Difference between std::getline
and std::wcin.get
Here is the data obtained by getline
. In this case, I don't get EOF, but the data is still weird:
std::wcout.imbue(std::locale("C")); // prevent commas
for (;;) {
std::getline(std::wcin, line);
if (std::wcin.eof()) {
std::wcout << L"EOF" << std::endl;
break;
}
int i, l = line.length();
for (i = 0; i < l; i ) {
wchar_t ch = line.at(i);
std::wcout << std::hex << (int) ch << L" ";
}
std::wcout << std::endl;
}
Output:
28 20 29 20 20 0 30e9 30e9 30e9 30a4 30a4 30a4 30f3
73 65 63 6f 6e 64 20 6c 69 6e 65
EOF
Where does the 0
come from? What's with the repeated characters? The characters following the 0
translate to ララライイイン
. (Note that here I do not try to output the received characters to wcout
, only the numeric values, in order to eliminate any possible effects of output encoding.)
The data obtained by get
is different, but no less strange:
// ...
std::wcout.imbue(std::locale("C")); // prevent commas
for (;;) {
wchar_t ch = std::wcin.get();
if (std::wcin.eof()) {
std::wcout << L"EOF" << std::endl;
break;
}
std::wcout << std::hex << (int) ch << L" ";
if (std::char_traits<wchar_t>::eq(ch, std::wcin.widen('\n'))) {
std::wcout << std::endl;
}
}
Output:
28 20 29 20 7ffe 7ffe 30e9 7ffe 7ffe 30a4 7ffe 7ffe 30f3 a
73 65 63 6f 6e 64 20 6c 69 6e 65 a
EOF
This translates to 翾翾ラ翾翾イ翾翾ン
. Where do those 7ffe
characters come from?
CodePudding user response:
This is a libc bug.
Note the bug report says that it only affects std::wcin
and not file streams, but in my experiments this is not the case. All wchar_t
streams seem to be affected.
The other major open source implementation, libstdc , doesn't have this bug. It is possible to sidestep the libc bug by building the entire application (including all dynamic libraries, if any) against libstdc .
If this is not an option, then one way to cope with the bug is to use narrow char
streams, and then, when needed, recode the characters (presumably arriving encoded as UTF-8) to wchar_t
(presumably UCS-4) separately. Another way is to get rid of wchar_t
altogether and work in UTF-8 throughout the program, which is probably better in the long run.