str.decode only some rows from the dataframe-CodePudding

In my dataset's column, there were a few weird characters like b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'. Using str.decode('latin1')fixed the problem. However, this messes up with the normal strings for example MyNamePlace and changes them to null.

content = gpd.read_file(fpath, encoding='utf-8')
content.RKI_NameDE = content.RKI_NameDE.str.decode('latin1')

Is it possible to automatically check which strings require decoding or not? How can I prevent decoding all rows?

Or how can I use an if else statement to check for maybe \x and then only modify those rows?

CodePudding user response：

Only change part of the DataFrame where it is byte str:

df = pd.DataFrame({'s':[b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte', 'MyNamePlace']})

byte_str = df['s'].map(type) == type(b'')
df.loc[byte_str, 's'] = df.loc[byte_str, 's'].str.decode('latin_1')

CodePudding user response：

If you have a condition to check with you could create a function and then use apply, returning either the decoded string or the original string if you don't need to decode it.

def decode_col(x):
   if condition:
      return x.RKI_NameDE.decode('latin1')
   return x.RKI_NameDE

content.RKI_NameDE = content.apply(decode_col, axis=1)