C std::printf formatting breaks with cyrillic alphabet-CodePudding

I've noticed somewhat of an unexpected behavior when using std::printf() with max field length specifiers like ls in conjuction with wchar_t (for cyrillic text).

Code example I use:

void printHeader() {
    printDelim();
    std::printf("\n|ls|ls|ls|ls|ls|", L"Имя", L"Континент", L"Длина", L"Глубина", L"Приток");
}

Simple function that prints delimiter (bunch of "-") and should be printing formatted line of titles (in Russian) separated by "|". So each field will be max 15 chars long and look pretty.

Notice:

Locale is set like this setlocale(LC_ALL, "") and Russian is present there.
If parameters passed to printf() are in English - works fine.
Just in case - output of the setlocale():

Locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=ru_RU.UTF-8;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=ru_RU.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=ru_RU.UTF-8;LC_NAME=ru_RU.UTF-8;LC_ADDRESS=ru_RU.UTF-8;LC_TELEPHONE=ru_RU.UTF-8;LC_MEASUREMENT=ru_RU.UTF-8;LC_IDENTIFICATION=ru_RU.UTF-8

Also tried it with std::wprintf(), but it does not print anything at all.
std::printf() with s and same strings without L prefix prints in the same "broken length" manner and cyrillic strings are correct.

I'm extremely curious why this happens with wchar_t.

P.S. - I'm aware that this code is almost literally C in C , which is a bad idea and practice. Unfortunately it is required to do so in this case.

CodePudding user response：

Let's look at cppreference.com's description of the %ls format specifier, because it explains one part of what's happening here in a very clear way:

If the l specifier is used, the argument must be a pointer to the initial element of an array of wchar_t, which is converted to char array as if by a call to wcrtomb with zero-initialized conversion state.

The key take-away is that %ls converts the wchar_t string to plain, narrow characters, as the first order of business. Basically, std::printf works with non-wide "characters", and allegedly-wide character strings get converted to non-wide "character" strings, before anything else happens.

Now that the input's domain consists of non-wide characters we make further progress:

Referencing "characters" in the context of the width specifier: it's really specifying the number of bytes. 15 bytes. That's what it really means:

Имя

This is not three "characters", as far as printf is concerned. This is a six character sequence, here they are: d0 98 d0 bc d1 8f.

Just in case - output of the setlocale():

Locale: LC_CTYPE=en_US.UTF-8 ...

Your system uses UTF-8 encoding, which uses more than one byte to encode non-Latin characters.

printf is a little bit dumb. It doesn't know anything about your locale, or your encoding. Every reference to character counts and field widths, in printf's documentation really means bytes. s, or ls, really means not 15 characters, but 15 bytes, to format here. So, it counts off 15 bytes, and spits them out. But, when interpreted as UTF-8 characters, these bytes don't really take up 15 characters on the screen.

Before Unicode, before the modern world with many alphabets, funny-looking characters, there was only the Latin alphabet, and characters and bytes were pretty much the same thing, and printf's documentation harkens back to that era. This is not true any more, but printf is still living in the past.