Localization
Many people think that C language support only ASCII code, this is a misunderstanding, you use printf (" this is a Chinese ") can also be output Chinese, with the fputs (" localization "of language C) it can also be written to the file in Chinese, so what default to C code? With the operating system, area, the compiler has a lot to do, because the C standard does not use any code, although you might find that other can output the string in Chinese, but can not use char C="in" to declare a Chinese character, if use strlen (" C ") to find the length of the string, given a value of 5, so by default, C is not very good support Chinese, how to make the C language is good enough to support Chinese? This needs us, from the concept of basic coding
Computer is just invented, only supports the ASCII code, that is to say, only supports English, along with the computer in the global rise of nations created their own code to display their own text, first used in Chinese GB2132 coding, it includes 6763 Chinese characters, daily work and study we will only use about 3000 Chinese characters, so everyday use enough, GBK includes 21003 characters, far in excess of the daily use of Chinese characters demand, can easily get either daily or commercial, so from GBK windows95 will start as the default character encoding, and GB18030 includes 27533 Chinese characters, as the study of Chinese characters, ancient books, and other fields to provide a unified information platform foundation, usually don't use this encoding at these Chinese compatible ASCII code itself, and USES the way of variable length record, English using a byte, a commonly used Chinese characters using 2 bytes, rare characters using four bytes, then with the global culture communication, people urgently need a global unified coding can be unified character all over the world, never because of different region and garbled words, then the Unicode character set was born, also known as a unified code, Unicode, its new operating system kernel itself to support Unicode, by the name of the Unicode can imagine how big it is a character set, in order to compatible with all parts of text, also considering the space and performance, Unicode provides three kinds of coding scheme:
Utf-8 variable length coding scheme, using 1 to 6 bytes to store
Utf - 32 fixed-length coding scheme, always use a 4 bytes to store
Utf - 16 between get longer and the balance between the fixed length, use 2 or 4 bytes to store
utf-8由于是变长方案,类似GB2132和GBK量体裁衣,最节省空间,但要通过第一个字节决定采用几个字节储存,编码最复杂,且由于变长要定位文字,就得从第一个字符开始计算,性能最低,utf-32由于是定长方案,字节数固定因此无需解码,性能最高但最浪费空间,utf-16是个怪胎,它将常用的字放在编号0 ~ FFFF之间,不用进行编码转换,对于不常用字的都放在10000~10FFFF编号之后,因此自然的解决变长的问题,注意对于这3种编码,只有utf-8兼容ascii,utf-32和utf-16都不支持单字节,由于utf-8最省流量,兼容性好,后来解码性能也得到了很大改善,同时新出来的硬件也越来越强,性能已不成问题,因此很多纯文本,源代码,网页,配置文件等都采用utf-8编码,从而代替了原来简陋的ascii,再来看看utf-16,对于常见字2个字节已经完全够用,很少会用到4个字节,因此通常也将utf-16划分为定长,Windows内核使用utf-16,linux,mac,ios内核使用的是utf-8,我们就不去争论谁好谁坏了,另外虽然windows内核为utf-16,但为了更好的本地化,控制面板提供了区域选项,如果设置为简体就是GBK编码,在win10中,控制台默认编码为gbk,文本默认编码为utf-8,其它第三方软件就不好说了,它们默认编码各不相同,现在我们来看看C语言采用什么编码,
What format C language source code used depends on the IDE environment, usually utf-8 or ANSI, ANSI code is what? Unicode compared it is taking a train of thought, ANSI is not strictly a coding, but an alternative, makes every effort to find the minimum code display content requirements, if the content is only English use ASCII characters, if it is found that will replace the cost to GBK Chinese characters coding, if it is found that there are both Chinese and Japanese and Korean will automatically choose the unicode haven't tried about using C language code, and in front of the operating system, regional selection, the compiler has a relationship, but there is a phenomenon, usually what source USES the format of the code, the runtime will use this code, so we can through the source code to see what encoding, the IDE will use the results of the test is in the Dev C + +, if only the English source by default use utf8, pure ASCII editor is rarely used now, if it is found that the source code with Chinese characters, then change the encoding to ANSI, due to the Windows default console ANSI, therefore can display Chinese characters, the standard input and output
If need only prints a string to the console or to write a Duan Zhongwen file, use the standard input and output functions, but if you want to character not operation, the char data type is not said two bytes of Chinese, this is an issue left over by history, for longer and fixed length, in the computer industry, there is also a term called narrow character and wide characters, at this point we need to rely on byte wide library wchar, import wchar. H can be used after the byte wide, as each character is represented in 2 bytes, wchar library USES fixed-length character encoding that is to say, the test code is as follows:
#include
#include
Int main ()
{
Would be wc1=L 'a';
Would be wc2=L 'wide';
Printf (" % d, % d ", sizeof (wc1), sizeof (wc2));
return 0;
}
Use the above code would declare a wide character, character to be added in front of the L, use sizeof () test result is no matter Chinese or English are 2 bytes, if you want to display a wide bytecode, need for putwchar () and wprintf () function, as follows:
#include
#include
#include
Int main ()
{
The setlocale (LC_ALL, "");
Would be wc1=L 'a';
Would be wc2=L 'in';
Putwchar (wc1);
Putwchar (wc2);
Wprintf (L % c % c, wc1, wc2);
return 0;
}
These two methods need to invoke the setlocale () function to initialize, tell them what kind of show wide characters, the setlocale () format for:
Char * setlocale (int the category, const char * locale)
Category is type, said region affect the type of type has a clock, currency, character sorting, etc., usually set to constant LC_ALL said affects all types, locale said area, Windows and Linux said regional way each are not identical, Windows according to simplified Chinese with "CHS", for example, Linux use "zh_CN", but three area is always the same: "C" neutral region, a region does not represent any, only to set point, is the default;" "Said local area; The domain name, simply return NULL said are not specified area, wide character input and output is independent of the standard input and output, it only setlocale () to set the value of the standard input and output using what encoding has nothing to do with it, and show wide character with what kind of coding, use a longer or fixed-length's compiler, if not set, the default area for the "C", this area can only display English, can't show any Chinese characters, the actual test results, so to call setlocale () sets the area to the local, due to the setlocale () contained in the locale. H, so to the guide for the locale, let's take a look at wide string:
#include
#include
#include
Int main ()
{
The setlocale (LC_ALL, "");
Would be WSTR []=L "wide string";
Wprintf (L "% s, % c \ n", WSTR, WSTR [1]).
Printf (" % d ", wcslen (WSTR));
return 0;
}
Statement should be used after wide string wcslen () function to obtain the length of the string, use subscript and pointer arithmetic can correct positioning, it is also because of the wide character using fixed reason, in addition, the copy of the string, such as connection has wide string manipulation method of form a complete set, can query, but gets () and puts () function is a pity, there is no corresponding wide string literals in the later in the standard library has added support for wide string, the test code is as follows:
#include
#include
#include
Int main ()
{
The setlocale (LC_ALL, "");
Would be c=L 'countries';
Would be WSTR []=L "wide string";
Printf (" % lc, % ls ", c, WSTR);
return 0;
}
The printf () can be numeric, common characters and output with wide character, but if you use the puts () directly output wide character is still gibberish,
CodePudding user response:
Piece. Thanks for sharing,,CodePudding user response:
Encourage supportCodePudding user response: