Home > Blockchain >  Pandas remove non-alphanumeric characters from string column
Pandas remove non-alphanumeric characters from string column

Time:10-01

with pandas and jupyter notebook I would like to delete everything that is not character, that is: hyphens, special characters etc etc

es:

firstname,birthday_date
joe-down§,02-12-1990
lucash brown_ :),06-09-1980
^antony,11-02-1987
mary|,14-12-2002

change with:

firstname,birthday_date
joe down,02-12-1990
lucash brown,06-09-1980
antony,11-02-1987
mary,14-12-2002

I'm trying with:

df['firstname'] = df['firstname'].str.replace(r'!', '')
df['firstname'] = df['firstname'].str.replace(r'^', '')
df['firstname'] = df['firstname'].str.replace(r'|', '')
df['firstname'] = df['firstname'].str.replace(r'§', '')
df['firstname'] = df['firstname'].str.replace(r':', '')
df['firstname'] = df['firstname'].str.replace(r')', '')

......
......
df

it seems to work, but on more populated columns I always miss some characters. Is there a way to completely eliminate all NON-text characters and keep only a single word or words in the same column? in the example I used firstname to make the idea better! but it would also serve for columns with whole words!

Thanks!

P.S also encoded text for emoticons

CodePudding user response:

Try the below. It works on the names you have used in post

first_names = ['joe-down§','lucash brown_','^antony','mary|']
clean_names = []
keep = {'-',' '}
for name in first_names:
    clean_names.append(''.join(c if c not in keep else ' ' for c in name if c.isalnum() or c in keep))
print(clean_names)

output

['joe down', 'lucash brown', 'antony', 'mary']

CodePudding user response:

You can use regex for this.

df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']
  • Related