Home > Software engineering >  std::wcin.eof(), UTF-8 and locales on different systems
std::wcin.eof(), UTF-8 and locales on different systems

Time:10-27

I have very little understanding of C streams and its handling of Unicode, trying to understand why code someone else wrote behaves in this way. I'd be very grateful if someone could explain to me what is going on.


MCVE:

#include <string>
#include <iostream>

int main() {
  std::basic_string<wchar_t> line;
  std::locale::global(std::locale("")); // This
  std::wcout.imbue(std::locale(""));    // This
  std::wcin.imbue(std::locale(""));     // This
  for (;;) {
    std::getline(std::wcin, line);
    if (std::wcin.eof()) {
      std::wcout << L"EOF" << std::endl;
      break;
    }
    std::wcout << line << std::endl;
  }
}

Sample input test.txt:

( ) ライン
second line

EDIT: Hexdump of test.txt:

$ xxd test.txt
00000000: 2820 2920 e383 a9e3 82a4 e383 b30a 7365  ( ) ..........se
00000010: 636f 6e64 206c 696e 650a                 cond line.

Results

On a CentOS server, this is the result (1):

$ ./a.out < test.txt
( ) ライン
second line
EOF

On my Mac though (2):

$ ./a.out < test.txt
( )  EOF

If I comment out the three marked locale lines, Redhat outputs (3):

$ ./a.out < test.txt
EOF

while Mac outputs (4):

$ ./a.out < test.txt
( ) ライン
second line
EOF

Questions

  • Why does the second (2) result detect EOF mid-line? Where does the second space before EOF come from? (This result baffles me the most.)
  • Why does the third (3) result detect EOF immediately?
  • Most importantly: What to do to always consistently get the first (1) or last result (4)?

Environment

Here is the environment for both machines:

CentOS Linux release 7.5.1804 (Core):

$ c   --version
c   (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

macOS Big Sur (version 11.6):

$ c   --version
Apple clang version 12.0.5 (clang-1205.0.22.11)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

Bonus

One additional puzzle. If I change the input to this (i.e. just add two more spaces inside the parentheses):

(   ) ライン
second line

the original (uncommented) code outputs this on Mac:

$ ./a.out < test.txt
(   )   ララライ翕翕ン
second line
EOF

Those are not artifacts of a messed-up terminal; all those extra characters are actually there:

$ ./a.out < test.txt | xxd
00000000: 2820 2020 2920 2020 e383 a9e3 83a9 e383  (   )   ........
00000010: a9e3 82a4 e7bf b7e7 bfb7 e383 b30a 7365  ..............se
00000020: 636f 6e64 206c 696e 650a 454f 460a       cond line.EOF.

Like... what?

EDIT In response to Giacomo Catenazzi's comment, I changed EOF printing from char to wide, which did fix one weirdness regarding input. My core issue is with reading wcin though, which proves to be unrelated.


EDIT Difference between std::getline and std::wcin.get

Here is the data obtained by getline. In this case, I don't get EOF, but the data is still weird:

std::wcout.imbue(std::locale("C")); // prevent commas
for (;;) {
  std::getline(std::wcin, line);
  if (std::wcin.eof()) {
    std::wcout << L"EOF" << std::endl;
    break;
  }
  int i, l = line.length();
  for (i = 0; i < l; i  ) {
    wchar_t ch = line.at(i);
    std::wcout << std::hex << (int) ch << L" ";
  }
  std::wcout << std::endl;
}

Output:

28 20 29 20 20 0 30e9 30e9 30e9 30a4 30a4 30a4 30f3 
73 65 63 6f 6e 64 20 6c 69 6e 65 
EOF

Where does the 0 come from? What's with the repeated characters? The characters following the 0 translate to ララライイイン. (Note that here I do not try to output the received characters to wcout, only the numeric values, in order to eliminate any possible effects of output encoding.)

The data obtained by get is different, but no less strange:

// ...
std::wcout.imbue(std::locale("C")); // prevent commas
for (;;) {
  wchar_t ch = std::wcin.get();
  if (std::wcin.eof()) {
    std::wcout << L"EOF" << std::endl;
    break;
  }
  std::wcout << std::hex << (int) ch << L" ";
  if (std::char_traits<wchar_t>::eq(ch, std::wcin.widen('\n'))) {
    std::wcout << std::endl;
  }
}

Output:

28 20 29 20 7ffe 7ffe 30e9 7ffe 7ffe 30a4 7ffe 7ffe 30f3 a 
73 65 63 6f 6e 64 20 6c 69 6e 65 a 
EOF

This translates to 翾翾ラ翾翾イ翾翾ン. Where do those 7ffe characters come from?

CodePudding user response:

This is a libc bug.

Note the bug report says that it only affects std::wcin and not file streams, but in my experiments this is not the case. All wchar_t streams seem to be affected.

The other major open source implementation, libstdc , doesn't have this bug. It is possible to sidestep the libc bug by building the entire application (including all dynamic libraries, if any) against libstdc .

If this is not an option, then one way to cope with the bug is to use narrow char streams, and then, when needed, recode the characters (presumably arriving encoded as UTF-8) to wchar_t (presumably UCS-4) separately. Another way is to get rid of wchar_t altogether and work in UTF-8 throughout the program, which is probably better in the long run.

  • Related