encoding utf-8 doesn't work with all German characters-CodePudding

I read a geo pandas file like this:

file = gpd.read_file('./County.shp', encoding='utf-8')
file.head()

For some cases, the encoding works well. For example, without the encoding, it is GÃ¶ttingenbut with the encoding, it is Göttingen.

However, it doesn't work for all cases. For example, Gebietseinheit Mittelfranken ohne Großstadte is read as b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'

How can I fix this?

CodePudding user response：

\xdf is ß; likewise, \xe4 is ä:

>>> '\xdf'
'ß'

>>> '\xe4'
'ä'

So there is nothing wrong with the encodings.

Really, it's because the file is read into a bytes string, which is what the b prefix means:

>>> b'\xdf'
b'\xdf'

>>> b'\xdf'
b'\xe4'

So they're the same values, but Python is just displaying them differently.

Additionally:

# With the b prefix:
>>> b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'

# Without the b prefix:
>>> 'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
'Gebietseinheit Kassel ohne Großstädte'

If you want to print the string with the special characters looking normal, use bytes.decode to convert it to a str, using the latin encoding:

>>> bytes_str = b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
>>> bytes_str
b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'

>>> normal_str = bytes_str.decode('latin1')
>>> normal_str
'Gebietseinheit Kassel ohne Großstädte'