Home > Back-end >  Vectorized str.replace for multiple characters in pandas
Vectorized str.replace for multiple characters in pandas

Time:06-10

I have a dataframe:

 {'country': {0: 'Afghanistan?*', 1: 'Albania?*'},
 'region': {0: 'Asia', 1: 'Europe'},
 'subregion': {0: 'Southern Asia', 1: 'Southern Europe'},
 'rate_per_1000': {0: 6.7, 1: 2.1},
 'count': {0: '2,474', 1: '61'},
 'year': {0: 2018, 1: 2020},
 'source': {0: 'NSO', 1: 'NSO'}}

          country  region        subregion  rate_per_1000  count  year source
0   Afghanistan?*    Asia    Southern Asia            6.7  2,474  2018    NSO
1       Albania?*  Europe  Southern Europe            2.1     61  2020    NSO

There are multiple bad characters here that I want to get rid of. I made a short function for .apply() to get rid of them, however I am looping over a defined list of bad characters. This gives a bad code smell to me, I think this operation could be more vectorized in some way. This is what I've tried:

bad_chars = ['?', '*', ',']

def string_cleaner(col):
    if col.dtype == 'object':
        for char in bad_chars:
            col = col.str.replace(f'{char}', '')
        return col

homicide_by_country = homicide_by_country.apply(string_cleaner)
homicide_by_country
        country  region        subregion rate_per_1000 count  year source
0   Afghanistan    Asia    Southern Asia          None  2474  None    NSO
1       Albania  Europe  Southern Europe          None    61  None    NSO

My desired outcome is a more pythonic/pandonic technique for accomplishing the same outcome.

edit: You may notice for some reason my rate_per_1000 columns goes blank. I haven't troubleshot that problem yet but if you spot something obvious I'm all ears.

CodePudding user response:

Seems like you need df.replace with regex=True

import re
>>> df.replace('|'.join(map(re.escape, bad_chars)),'', regex=True)

Notice that this will keep the same dtypes of your columns, so no need to worry about numeric cols.

Also, note that you need a special treatment of your regex because ?, * etc are special characters in regular expressions, so you need to escape these chars.

CodePudding user response:

You can use select_dtypes to target your replacement, then update:

import re
bad_chars = ['?', '*', ',']
reg = f'[{"".join(map(re.escape, bad_chars))}]'
df.update(df
   .select_dtypes(object)
   .apply(lambda c: c.str.replace(reg, '', regex=True))
 )

print(df)

NB. The modification is in place

Output:

       country  region        subregion  rate_per_1000 count  year source
0  Afghanistan    Asia    Southern Asia            6.7  2474  2018    NSO
1      Albania  Europe  Southern Europe            2.1    61  2020    NSO
  • Related