with pandas and jupyter notebook I would like to delete everything that is not character, that is: hyphens, special characters etc etc
es:
firstname,birthday_date
joe-down§,02-12-1990
lucash brown_ :),06-09-1980
^antony,11-02-1987
mary|,14-12-2002
change with:
firstname,birthday_date
joe down,02-12-1990
lucash brown,06-09-1980
antony,11-02-1987
mary,14-12-2002
I'm trying with:
df['firstname'] = df['firstname'].str.replace(r'!', '')
df['firstname'] = df['firstname'].str.replace(r'^', '')
df['firstname'] = df['firstname'].str.replace(r'|', '')
df['firstname'] = df['firstname'].str.replace(r'§', '')
df['firstname'] = df['firstname'].str.replace(r':', '')
df['firstname'] = df['firstname'].str.replace(r')', '')
......
......
df
it seems to work, but on more populated columns I always miss some characters. Is there a way to completely eliminate all NON-text characters and keep only a single word or words in the same column? in the example I used firstname to make the idea better! but it would also serve for columns with whole words!
Thanks!
P.S also encoded text for emoticons
CodePudding user response:
Try the below. It works on the names you have used in post
first_names = ['joe-down§','lucash brown_','^antony','mary|']
clean_names = []
keep = {'-',' '}
for name in first_names:
clean_names.append(''.join(c if c not in keep else ' ' for c in name if c.isalnum() or c in keep))
print(clean_names)
output
['joe down', 'lucash brown', 'antony', 'mary']
CodePudding user response:
You can use regex for this.
df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']