Online to find a number of ways to use, basic it is according to the documents of the first two bytes judgment, but I have to deal with this file not said in the first two bytes format,
http://bbs.csdn.net/topics/310101466 as this post code, is judgment, not to come out by utf8,
With editplus open a text file will see below the status bar shows that it is utf-8 encoding, but the above post judgment is ANSI, see below:
In hexadecimal looked at, the first two bytes are file content itself, there is no specific markers encoding, see below:
http://img.my.csdn.net/uploads/201412/04/1417659625_4106.jpg
Don't know how to found, editplus notepad save as it will automatically choose to utf-8, aren't they are read one by one analysis the content inside? There were many files to be processed, and some are some are ANSI utf8, the purpose is to pick out the modify the inside part of the contents of the ANSI, concrete is the web page meta tags in the charset=utf-8 into charset=euc - kr or open will be garbled,
The test file download: http://files.cnblogs.com/sysdzw/%E6%B5%8B%E8%AF%95%E6%96%87%E4%BB%B6.rar
CodePudding user response:
Given charset=euc - kr that should not be utf-8
CodePudding user response:
The meaning of the original poster can be: set charset into A code, save the file using the code B
Nature of the porters:
http://bbs.csdn.net/topics/370095245
CodePudding user response:
reference 1st floor bcrun response: given charset=euc - kr it shouldn't be utf-8 File ANSI + charset=utf-8 -- HTML file open the code File ANSI + charset=euc - kr - HTML file open normal Is now a large number of HTML files in a folder, with ANSI has utf8, can't distinguish between the target file for the above adjustment, Upstairs hair that according to the file content seems to also won't work, the building Lord finally also does not have complete solution, is not for me to study how to invoke editplus to get at the bottom of the code, CodePudding user response:
I take application UltraEdit open your test files, to identify the failure as a result, all became known utf8... Traverse is little not, only through the inside of the code to identify, slightly painful, Reference utf-8 is a way of variable length byte code, for a certain character of utf-8, if only one byte is the highest binary 0; Multibyte, if it is their first bytes starting from the highest level, continuous binary value is 1 the number of digits, and determines the encoding the rest each byte all begin with 10, utf-8 available at most 6 bytes, As table: 1 byte 0 XXXXXXX 2 bytes 110 XXXXX 10 XXXXXX 3 bytes 1110 XXXX 10 XXXXXX XXXXXX 4 bytes 11110 XXX 10 10 XXXXXX XXXXXX XXXXXX 5 bytes 111110 xx 10 10 10 XXXXXX XXXXXX XXXXXX XXXXXX 6 bytes, 1111110 x 10 10 10 10 XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX Your files according to the characters of the scale only two cases: The first kind, the entire file all letters, Numbers, symbols, and that's nothing to tangle, ANSI and UTF8, read it directly, is stored in utf-8; Second, if contains non-ascii scope, then the entire file read it word by word section analysis: if it is found that there are 11 XXXXXX (greater than C0) bytes in the beginning, it begins to take the first 1, the length of the access to the length of the characters, take out again in the future by reading the content of the corresponding length, whether all begin with 10 (greater than 80), and if so, that is utf-8, if it is not just simple and crude to return into the ANSI, CodePudding user response:
, however, is such a great article, but the rules are very clear, written also not complex, execution will not takes time, and in order to improve the accuracy of recognition, or the entire file all inspection CodePudding user response:
Feel your ideas may be left some kind of misunderstanding, so, you try this is domestic is stronger than editplus software If can automatically identify your file, then add the his software group, personally ask the author is according to the order of judgment, according to my understanding in the software, technical secrets, this is not what he should be willing to share: http://www.everedit.net/ CodePudding user response:
reference 5 floor Runnerchin reply: , however, is such a great article, but the rules are very clear, written also not complex, execution will not takes time, and in order to improve the accuracy of recognition, or the entire file all check down You said this, and we are more likely to think of a way, of course, in the actual product, want to consider all aspects of fault tolerance, but the full scan judgment in this step, you said I think is the necessary step. CodePudding user response:
Send a judge web page code format, and see if I can help the building Lord Private Function IsUTF8 (Bytes) As Boolean 'web coding judgment On Error GoTo Err Dim As Long, I AscN As Long, Length As Long Length=UBound (Bytes) + 1 If the Length & lt; 3 Then IsUTF8=False The Exit Function ElseIf Bytes (0)=& amp; HEF And Bytes (1)=& amp; HBB And Bytes (2)=& amp; HBF Then IsUTF8=True The Exit Function End the If The Do While I & lt; 1 =Length -If Bytes (I) & lt; Then 128 I=I + 1 AscN=AscN + 1 ElseIf (Bytes (I) And & amp; HE0)=& amp; HC0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 Then I=I + 2 ElseIf I + 2 & lt; Length Then If (Bytes (I) And & amp; HF0)=& amp; HE0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 And (Bytes (I + 2) And & amp; HC0)=& amp; H80 Then I=I + 3 The Else IsUTF8=False The Exit Function End the If The Else IsUTF8=False The Exit Function End the If Loop If AscN=Length Then IsUTF8=False The Else IsUTF8=True End the If The Exit Function Err: EM "IsUTF8 " End the Function CodePudding user response:
refer to the eighth floor qq_15724883 response: send a judge web page code format, and see if I can help the building Lord Private Function IsUTF8 (Bytes) As Boolean 'web coding judgment On Error GoTo Err Dim As Long, I AscN As Long, Length As Long Length=UBound (Bytes) + 1 If the Length & lt; 3 Then IsUTF8=False The Exit Function ElseIf Bytes (0)=& amp; HEF And Bytes (1)=& amp; HBB And Bytes (2)=& amp; HBF Then IsUTF8=True The Exit Function End the If The Do While I & lt; 1 =Length -If Bytes (I) & lt; Then 128 I=I + 1 AscN=AscN + 1 ElseIf (Bytes (I) And & amp; HE0)=& amp; HC0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 Then I=I + 2 ElseIf I + 2 & lt; Length Then If (Bytes (I) And & amp; HF0)=& amp; HE0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 And (Bytes (I + 2) And & amp; HC0)=& amp; H80 Then nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull