How to tell python that a string is actually bytes-object? Not converting-CodePudding

I have a txt file which contains a line:

 '        6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'

The contents in the double quotes is actually octal encoding, but with two escape characters.

After the line has been read in, I used regex to extract the contents in the double quotes.

c = re.search(r': "(. )"', line).group(1)

After that, I have two problem:

First, I need to replace the two escape characters with one.

Second, Tell python that the str object c is actually a byte object.

None of them has been done.

I have tried:

re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)

All failed.

A bytes object can be easily define with 'b'.

c = b'\351\231\220\346\227\266\345\205\215\350\264\271'

How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.

I googled a lot, but with no answers. Maybe I use the wrong key word.

Does anyone know how to do these? Or other way to get what I want?

CodePudding user response：

This is always a little confusing. I assume your bytes object should represent a string like:

b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'

To get that with your escaped string, you could use the codecs library and try:

import re
import codecs

line =  '        6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(. )"', line).group(1)

codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'

giving the same result.

CodePudding user response：

In Python, a string is a sequence of Unicode characters. If you have a string that represents a sequence of bytes, you can indicate this by marking the string as a bytes object by prefixing it with b.

For example:

python code byte_string = b'This is a byte string.' print(type(byte_string)) # Output: <class 'bytes'> In this example, the b prefix indicates that the string is a sequence of bytes, and the type of byte_string is bytes. Note that you cannot use special characters like \n in a byte string, as they are interpreted as part of the string, not as special characters.

It's important to note that a bytes object is different from a str (string) object in terms of encoding, representation and processing. So, marking a string as a bytes object is important when you are dealing with binary data, such as reading from a binary file or sending data over a network connection.

CodePudding user response：

The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.

Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.

Step-by-Step:

>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s)  # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1')  # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹'  # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'

Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:

>>> # Python dictionary-like text
 d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d)  # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'