I've noticed somewhat of an unexpected behavior when using std::printf()
with max field length specifiers like ls
in conjuction with wchar_t
(for cyrillic text).
Code example I use:
void printHeader() {
printDelim();
std::printf("\n|ls|ls|ls|ls|ls|", L"Имя", L"Континент", L"Длина", L"Глубина", L"Приток");
}
Simple function that prints delimiter (bunch of "-") and should be printing formatted line of titles (in Russian) separated by "|". So each field will be max 15 chars long and look pretty.
Actual output: | Имя|Континент| Длина|Глубина| Приток |
Notice:
- Locale is set like this
setlocale(LC_ALL, "")
and Russian is present there. - If parameters passed to
printf()
are in English - works fine. - Just in case - output of the
setlocale()
:
Locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=ru_RU.UTF-8;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=ru_RU.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=ru_RU.UTF-8;LC_NAME=ru_RU.UTF-8;LC_ADDRESS=ru_RU.UTF-8;LC_TELEPHONE=ru_RU.UTF-8;LC_MEASUREMENT=ru_RU.UTF-8;LC_IDENTIFICATION=ru_RU.UTF-8
- Also tried it with
std::wprintf()
, but it does not print anything at all. std::printf()
withs
and same strings withoutL
prefix prints in the same "broken length" manner and cyrillic strings are correct.
I'm extremely curious why this happens with wchar_t
.
P.S. - I'm aware that this code is almost literally C in C , which is a bad idea and practice. Unfortunately it is required to do so in this case.
CodePudding user response:
Let's look at cppreference.com's description of the %ls
format specifier, because it explains one part of what's happening here in a very clear way:
If the l specifier is used, the argument must be a pointer to the initial element of an array of wchar_t, which is converted to char array as if by a call to wcrtomb with zero-initialized conversion state.
The key take-away is that %ls
converts the wchar_t
string to plain, narrow characters, as the first order of business. Basically, std::printf
works with non-wide "characters", and allegedly-wide character strings get converted to non-wide "character" strings, before anything else happens.
Now that the input's domain consists of non-wide characters we make further progress:
Referencing "characters" in the context of the width specifier: it's really specifying the number of bytes. 15 bytes. That's what it really means:
Имя
This is not three "characters", as far as printf
is concerned. This is a six character sequence, here they are: d0 98 d0 bc d1 8f
.
Just in case - output of the setlocale():
Locale: LC_CTYPE=en_US.UTF-8 ...
Your system uses UTF-8 encoding, which uses more than one byte to encode non-Latin characters.
printf
is a little bit dumb. It doesn't know anything about your locale, or your encoding. Every reference to character counts and field widths, in printf
's documentation really means bytes. s
, or ls
, really means not 15 characters, but 15 bytes, to format here. So, it counts off 15 bytes, and spits them out. But, when interpreted as UTF-8 characters, these bytes don't really take up 15 characters on the screen.
Before Unicode, before the modern world with many alphabets, funny-looking characters, there was only the Latin alphabet, and characters and bytes were pretty much the same thing, and printf
's documentation harkens back to that era. This is not true any more, but printf
is still living in the past.