Home > Software design >  Why trying to print unicode encoded strings with cout leads to compilation error in newer C standa
Why trying to print unicode encoded strings with cout leads to compilation error in newer C standa

Time:02-01

I tried the following printing of Unicode characters with Visual C 2022 Version 17.4.4 with C standard set to the latest.

#include <iostream>

using namespace std;

int main()
{
  cout << u8"The official vowels in Danish are: a, e, i, o, u, \u00E6, \u00F8, \u00E5 and y.\n";
  return 0;
}

I have the compilation error:

1>C:\projects\cpp\test\test.cpp(7,8): error C2280: 'std::basic_ostream<char,std::char_traits<char>> &std::operator <<<std::char_traits<char>>(std::basic_ostream<char,std::char_traits<char>> &,const char8_t *)': attempting to reference a deleted function
1>C:\projects\cpp\test\test.cpp(7,8): error C2088: '<<': illegal for class

The same behavior is observed with u (utf-16) and U (utf-32) string literals.

Setting the standard to C 17 or C 14 makes the program to compile.

What is the rationale for disallowing this code in C 20 and later standards and what is the correct way to print Unicode string literals in those standards?

CodePudding user response:

Until C 20, u8"..." was const char[N]. Since C 20, it is now const char8_t[N].

std::cout is a std::basic_ostream<char>, and thus can't output char8_t data since C 20.

The possible work around:

std::basic_ostream<char>& operator<<(std::basic_ostream<char>& cout, const char8_t* s) {
  cout << reinterpret_cast<const char*>(s);
  return cout;
}

// Output: The official vowels in Danish are: a, e, i, o, u, æ, ø, å and y.

CodePudding user response:

What is the rationale for disallowing this code in C 20

Firstly, in pre-C 20, there was no char8_t type. The u8 prefix would simply produce char data while affecting its encoding.

C 20 introduced char8_t in p0482, and backward incompatibly changed the u8 prefix to produce char8_t data.

But, as p1423 points out, that introduced a silent, counterproductive behavioral change, and the proposed solution was to make the operation ill-formed instead:

An unintended and silent behavioral change was introduced with the adoption of P0482R6. In C 17, the following code wrote the code units of the literals to stdout. In C 20, this code now writes the character literal as a number, and the address of the string literal, to stdout.

std::cout << u8'x';    // In C  20, writes the number 120.
std::cout << u8"text"; // In C  20, writes a memory address.

This is a surprising change that provides no benefit to programmers. Adding deleted ostream inserters would avoid this surprising behavioral change while reserving the possibility to specify behavior for these operations in the future (for example, to specify implicit transcoding to the execution encoding).


what is the correct way to print Unicode string literals in those standards?

As of C 20, there has not been any standard way defined to print char8_t, char16_t or char32_t as text directly. You will have to convert the Unicode data into the native encoding used by char or wchar_t, and then print that. There's no standard way to do such a conversion (that isn't deprecated), though.

  • Related