Home > Software engineering >  Vb how to judge the encoding of a text file format?
Vb how to judge the encoding of a text file format?

Time:10-02

Online to find a number of ways to use, basic it is according to the documents of the first two bytes judgment, but I have to deal with this file not said in the first two bytes format,
http://bbs.csdn.net/topics/310101466 as this post code, is judgment, not to come out by utf8,

With editplus open a text file will see below the status bar shows that it is utf-8 encoding, but the above post judgment is ANSI, see below:



In hexadecimal looked at, the first two bytes are file content itself, there is no specific markers encoding, see below:
http://img.my.csdn.net/uploads/201412/04/1417659625_4106.jpg

Don't know how to found, editplus notepad save as it will automatically choose to utf-8, aren't they are read one by one analysis the content inside? There were many files to be processed, and some are some are ANSI utf8, the purpose is to pick out the modify the inside part of the contents of the ANSI, concrete is the web page meta tags in the charset=utf-8 into charset=euc - kr or open will be garbled,

The test file download: http://files.cnblogs.com/sysdzw/%E6%B5%8B%E8%AF%95%E6%96%87%E4%BB%B6.rar

CodePudding user response:

Given charset=euc - kr that should not be utf-8

CodePudding user response:

The meaning of the original poster can be: set charset into A code, save the file using the code B

Nature of the porters:
http://bbs.csdn.net/topics/370095245

CodePudding user response:

reference 1st floor bcrun response:
given charset=euc - kr it shouldn't be utf-8


File ANSI + charset=utf-8 -- HTML file open the code
File ANSI + charset=euc - kr - HTML file open normal

Is now a large number of HTML files in a folder, with ANSI has utf8, can't distinguish between the target file for the above adjustment,

Upstairs hair that according to the file content seems to also won't work, the building Lord finally also does not have complete solution, is not for me to study how to invoke editplus to get at the bottom of the code,

CodePudding user response:

I take application UltraEdit open your test files, to identify the failure as a result, all became known utf8...
Traverse is little not, only through the inside of the code to identify, slightly painful,
Reference
utf-8 is a way of variable length byte code, for a certain character of utf-8, if only one byte is the highest binary 0; Multibyte, if it is their first bytes starting from the highest level, continuous binary value is 1 the number of digits, and determines the encoding the rest each byte all begin with 10, utf-8 available at most 6 bytes,
As table:
1 byte 0 XXXXXXX
2 bytes 110 XXXXX 10 XXXXXX
3 bytes 1110 XXXX 10 XXXXXX XXXXXX
4 bytes 11110 XXX 10 10 XXXXXX XXXXXX XXXXXX
5 bytes 111110 xx 10 10 10 XXXXXX XXXXXX XXXXXX XXXXXX
6 bytes, 1111110 x 10 10 10 10 XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX

Your files according to the characters of the scale only two cases:
The first kind, the entire file all letters, Numbers, symbols, and that's nothing to tangle, ANSI and UTF8, read it directly, is stored in utf-8;
Second, if contains non-ascii scope, then the entire file read it word by word section analysis: if it is found that there are 11 XXXXXX (greater than C0) bytes in the beginning, it begins to take the first 1, the length of the access to the length of the characters, take out again in the future by reading the content of the corresponding length, whether all begin with 10 (greater than 80), and if so, that is utf-8, if it is not just simple and crude to return into the ANSI,

CodePudding user response:

, however, is such a great article, but the rules are very clear, written also not complex, execution will not takes time, and in order to improve the accuracy of recognition, or the entire file all inspection

CodePudding user response:

Feel your ideas may be left some kind of misunderstanding, so, you try this is domestic is stronger than editplus software
If can automatically identify your file, then add the his software group, personally ask the author is according to the order of judgment, according to my understanding in the software, technical secrets, this is not what he should be willing to share:
http://www.everedit.net/

CodePudding user response:

reference 5 floor Runnerchin reply:
, however, is such a great article, but the rules are very clear, written also not complex, execution will not takes time, and in order to improve the accuracy of recognition, or the entire file all check down


You said this, and we are more likely to think of a way, of course, in the actual product, want to consider all aspects of fault tolerance, but the full scan judgment in this step, you said I think is the necessary step.

CodePudding user response:

Send a judge web page code format, and see if I can help the building Lord

 
Private Function IsUTF8 (Bytes) As Boolean
'web coding judgment
On Error GoTo Err

Dim As Long, I AscN As Long, Length As Long

Length=UBound (Bytes) + 1

If the Length & lt; 3 Then
IsUTF8=False
The Exit Function
ElseIf Bytes (0)=& amp; HEF And Bytes (1)=& amp; HBB And Bytes (2)=& amp; HBF Then
IsUTF8=True
The Exit Function
End the If

The Do While I & lt; 1
=Length -If Bytes (I) & lt; Then 128
I=I + 1
AscN=AscN + 1
ElseIf (Bytes (I) And & amp; HE0)=& amp; HC0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 Then
I=I + 2
ElseIf I + 2 & lt; Length Then
If (Bytes (I) And & amp; HF0)=& amp; HE0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 And (Bytes (I + 2) And & amp; HC0)=& amp; H80 Then
I=I + 3
The Else
IsUTF8=False
The Exit Function
End the If
The Else
IsUTF8=False
The Exit Function
End the If
Loop

If AscN=Length Then
IsUTF8=False
The Else
IsUTF8=True
End the If

The Exit Function
Err:
EM "IsUTF8
"
End the Function

CodePudding user response:

refer to the eighth floor qq_15724883 response:
send a judge web page code format, and see if I can help the building Lord

 
Private Function IsUTF8 (Bytes) As Boolean
'web coding judgment
On Error GoTo Err

Dim As Long, I AscN As Long, Length As Long

Length=UBound (Bytes) + 1

If the Length & lt; 3 Then
IsUTF8=False
The Exit Function
ElseIf Bytes (0)=& amp; HEF And Bytes (1)=& amp; HBB And Bytes (2)=& amp; HBF Then
IsUTF8=True
The Exit Function
End the If

The Do While I & lt; 1
=Length -If Bytes (I) & lt; Then 128
I=I + 1
AscN=AscN + 1
ElseIf (Bytes (I) And & amp; HE0)=& amp; HC0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 Then
I=I + 2
ElseIf I + 2 & lt; Length Then
If (Bytes (I) And & amp; HF0)=& amp; HE0 And (Bytes (I + 1) And & amp; HC0)=& amp; H80 And (Bytes (I + 2) And & amp; HC0)=& amp; H80 Then
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related