I have a dataframe:
{'country': {0: 'Afghanistan?*', 1: 'Albania?*'},
'region': {0: 'Asia', 1: 'Europe'},
'subregion': {0: 'Southern Asia', 1: 'Southern Europe'},
'rate_per_1000': {0: 6.7, 1: 2.1},
'count': {0: '2,474', 1: '61'},
'year': {0: 2018, 1: 2020},
'source': {0: 'NSO', 1: 'NSO'}}
country region subregion rate_per_1000 count year source
0 Afghanistan?* Asia Southern Asia 6.7 2,474 2018 NSO
1 Albania?* Europe Southern Europe 2.1 61 2020 NSO
There are multiple bad characters here that I want to get rid of. I made a short function for .apply() to get rid of them, however I am looping over a defined list of bad characters. This gives a bad code smell to me, I think this operation could be more vectorized in some way. This is what I've tried:
bad_chars = ['?', '*', ',']
def string_cleaner(col):
if col.dtype == 'object':
for char in bad_chars:
col = col.str.replace(f'{char}', '')
return col
homicide_by_country = homicide_by_country.apply(string_cleaner)
homicide_by_country
country region subregion rate_per_1000 count year source
0 Afghanistan Asia Southern Asia None 2474 None NSO
1 Albania Europe Southern Europe None 61 None NSO
My desired outcome is a more pythonic/pandonic technique for accomplishing the same outcome.
edit: You may notice for some reason my rate_per_1000 columns goes blank. I haven't troubleshot that problem yet but if you spot something obvious I'm all ears.
CodePudding user response:
Seems like you need df.replace
with regex=True
import re
>>> df.replace('|'.join(map(re.escape, bad_chars)),'', regex=True)
Notice that this will keep the same dtypes of your columns, so no need to worry about numeric cols.
Also, note that you need a special treatment of your regex because ?
, *
etc are special characters in regular expressions, so you need to escape these chars.
CodePudding user response:
You can use select_dtypes
to target your replacement, then update
:
import re
bad_chars = ['?', '*', ',']
reg = f'[{"".join(map(re.escape, bad_chars))}]'
df.update(df
.select_dtypes(object)
.apply(lambda c: c.str.replace(reg, '', regex=True))
)
print(df)
NB. The modification is in place
Output:
country region subregion rate_per_1000 count year source
0 Afghanistan Asia Southern Asia 6.7 2474 2018 NSO
1 Albania Europe Southern Europe 2.1 61 2020 NSO