How to identify double-byte characters is C? Such as Chinese characters. Thank you-CodePudding

How to identify double-byte characters is c + +? Such as Chinese characters, thanks
For example:
Have a text file TEST. TXT, only a line, a total of 3 bytes, content is as follows:
A good

How to put the 2 and 3 bytes identify the Chinese character "good"?
Especially in memory pointer bytes,

thank you

CodePudding user response:

According to the code, the ANSI (multibyte character set) utf8 unicode

CodePudding user response:

ASCII only seven, the highest for 1, if it is 1, then the double byte processing,

CodePudding user response:

Inside the console to recognize only GBK Chinese characters coding or GB2312
In this case, the Chinese characters is double byte, and each byte of the ASCII value greater than 127
From the very beginning, two consecutive ASCII value is greater than 127 bytes is a Chinese
(STR [I] & amp; 0 x80 & amp; & STR + 1] [I & amp; 0 x80) meet the conditions of STR [I] and STR [I + 1] constitute a Chinese characters

CodePudding user response:

The String can be stored, Chinese characters?

CodePudding user response:

refer to the second floor truth is right or wrong response:

ASCII only seven, highest level not to 1, if it is 1, then the double byte processing,

Two virtual two bytes, and must each byte of the highest is 1, is the Chinese characters, or double-byte characters?
Or high peak is 1, the highest level of low is 0?
Thank you

CodePudding user response:

Fun

reference 3 floor response:

inside the console to recognize only GBK Chinese characters coding or GB2312
In this case, the Chinese characters is double byte, and each byte of the ASCII value greater than 127
From the very beginning, two consecutive ASCII value is greater than 127 bytes is a Chinese
(STR [I] & amp; 0 x80 & amp; & STR + 1] [I & amp; 0 x80) meet the conditions of STR [I] and STR [I + 1] constitute a Chinese characters

Thank you for the great god
Two virtual two bytes, and must each byte of the highest is 1, is the Chinese characters, or double-byte characters?
Or high peak is 1, the highest level of low is 0?

CodePudding user response:

reference 5 floor oracleperl reply:

Quote: refer to the second floor truth is right or wrong response:

ASCII only seven, the highest for 1, if it is 1, then the double byte processing,

Two virtual two bytes, and must each byte of the highest is 1, is the Chinese characters, or double-byte characters?
Or high peak is 1, the highest level of low is 0?
Thank you

This is multi-byte characters specific said, "A good" Chinese characters both high byte 1
PszMultiByte [0] 0 x41 'A' char
PszMultiByte [1] 0 xba '? 'char
PszMultiByte [2] 0 xc3 '? 'char

CodePudding user response:

(1) Unicode file judgment - IsTextUnicode (lpBuffer, cb, lpi)

IFileLength=GetFileSize (hFile, NULL);
PBuffer=malloc (iFileLength + 2);//two bytes to deposit more \ 0
//read the files into the buffer, and in the end of the end of the file to put \ 0
ReadFile (hFile, pBuffer, iFileLength, & amp; DwBytesRead, NULL);
The CloseHandle (hFile);
PBuffer [iFileLength]='\ 0';
PBuffer [iFileLength + 1]='\ 0';
//test whether text is Unicode, the first two bytes 0 xfeff or 0 xfffe
//IS_TEXT_UNICODE_SIGNATURE - 0 xfeff (small side: high and low)
//IS_TEXT_UNICODE_REVERSE_SIGNATURE - 0 xfffe (heel)
IUniTest=IS_TEXT_UNICODE_SIGNATURE | IS_TEXT_UNICODE_REVERSE_SIGNATURE;

If (IsTextUnicode (pBuffer, iFileLength, & amp; IUniTest))//the most three parameters is the in/out parameter, will meet the conditions to iUniTest
{
PText=pBuffer + 2;//over the first two bytes, pointing to the body
IFileLength -=2;
If (iUniTest & amp; IS_TEXT_UNICODE_REVERSE_SIGNATURE)//main storage, change the byte order
{
For (int I=0; i {
BySwap=pText [2 * I];
PText (2 * I]=pText [2 * I + 1);
PText (2 * I + 1)=bySwap;
}
}
//for possible string conversion allocates memory
PConv=malloc (iFileLength + 2);
# # ifndef UNICODE//Edit controls the use of UNICODE, displays before the UNICODE text text changes into bytes
PText WideCharToMultiByte (CP_ACP, 0, (PTSTR), 1, pConv, iFileLength + 2, NULL, NULL);//total iFileLength + 2 bytes, including 2\0
# else//Edit controls Unicode, direct copy text
Lstrcpy (PTSTR pConv, PTSTR pText);
# endif
} the else//a non-unicode file
{
PText=pBuffer;
PConv=malloc (2 * iFileLength + 2);//ASCII Unicode, need to double space, two extra \ 0,

//for possible string conversion allocates memory
Use UNICODE # ifdef UNICODE//Edit controls, displays, multibyte text can be converted to UNICODE text
MultiByteToWideChar (pText CP_ACP, 0, 1, (PTSTR) pConv, iFileLength + 1);//find \ 0, a total of iFileLength + 1 characters (including \ 0)
# else//Edit control is to use ASCII, you directly copy text
Lstrcpy (PTSTR pConv, PTSTR pText);
# endif
}
SetWindowText (hwndEdit PTSTR pConv);

https://www.cnblogs.com/5iedu/p/4695106.html

CodePudding user response:

Don't do "recognition", just as a binary processing

CodePudding user response:

https://baike.baidu.com/item/GBK/3910360 font? Fr=4 _1 aladdin#
https://baike.baidu.com/item/character encoding/8446880? Fr=aladdin# 4

See GBK, GB2312 encoding range, including explicit

CodePudding user response:

Body code range is 8140 - FEFE, first in the 81 - byte, between FE trail byte between 40 and FE, eliminate xx7F line ,
A total of 23940 yards, total income of 21886 Chinese characters and graphics symbol, including Chinese characters (including radical and components), 21003, 883 graphic symbol,
All code is divided into three parts:
1. The area of Chinese characters, including:
A. GB 2312 Chinese characters, namely, GBK/2: B0A1 - F7FE, included 6763 GB 2312 characters, according to the original order,
B. 13000.1 GB extend the area of Chinese characters, including:
GBK/3 (1) : 8140 - A0FE, CJK characters of GB 13000.1 6080,
(2) GBK/4: AA40 - FEA0, included CJK, 8160 Chinese characters, Chinese characters, and patches CJK Chinese characters in the former, according to the UCS code size; Supplement of Chinese characters (including radical and components), according to "the kangxi dictionary" page/word arrangement,
2. The graphic symbol area, including:
A. GB 2312 the Chinese character symbol section, or GBK/1: A1A1 - A9FE, which in addition to the GB 2312 symbols, and ten lower case Roman numerals and GB 12345 supplementary symbol, 717 meter symbols,
B. GB 13000.1 expands the area of Chinese characters, namely GBK/5: A840 - A9A0, BIG - 5 Chinese character symbol, structure character and "ling" is arranged in the region, 166 meter symbols,
3. The custom area: divided into (1) (2) (3) three village,
(1) AAA1 - AFFE, code 564,
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull