I read a geo pandas file like this:
file = gpd.read_file('./County.shp', encoding='utf-8')
file.head()
For some cases, the encoding works well. For example, without the encoding, it is Göttingen
but with the encoding, it is Göttingen
.
However, it doesn't work for all cases. For example, Gebietseinheit Mittelfranken ohne Großstadte
is read as b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
How can I fix this?
CodePudding user response:
\xdf
is ß
; likewise, \xe4
is ä
:
>>> '\xdf'
'ß'
>>> '\xe4'
'ä'
So there is nothing wrong with the encodings.
Really, it's because the file is read into a bytes
string, which is what the b
prefix means:
>>> b'\xdf'
b'\xdf'
>>> b'\xdf'
b'\xe4'
So they're the same values, but Python is just displaying them differently.
Additionally:
# With the b prefix:
>>> b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
# Without the b prefix:
>>> 'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
'Gebietseinheit Kassel ohne Großstädte'
If you want to print the string with the special characters looking normal, use bytes.decode
to convert it to a str
, using the latin
encoding:
>>> bytes_str = b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
>>> bytes_str
b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
>>> normal_str = bytes_str.decode('latin1')
>>> normal_str
'Gebietseinheit Kassel ohne Großstädte'