In my dataset's column, there were a few weird characters like b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
. Using str.decode('latin1')
fixed the problem. However, this messes up with the normal strings for example MyNamePlace
and changes them to null.
content = gpd.read_file(fpath, encoding='utf-8')
content.RKI_NameDE = content.RKI_NameDE.str.decode('latin1')
Is it possible to automatically check which strings require decoding or not? How can I prevent decoding all rows?
Or how can I use an if else statement to check for maybe \x
and then only modify those rows?
CodePudding user response:
Only change part of the DataFrame where it is byte str:
df = pd.DataFrame({'s':[b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte', 'MyNamePlace']})
byte_str = df['s'].map(type) == type(b'')
df.loc[byte_str, 's'] = df.loc[byte_str, 's'].str.decode('latin_1')
CodePudding user response:
If you have a condition to check with you could create a function and then use apply, returning either the decoded string or the original string if you don't need to decode it.
def decode_col(x):
if condition:
return x.RKI_NameDE.decode('latin1')
return x.RKI_NameDE
content.RKI_NameDE = content.apply(decode_col, axis=1)