Post interim dataframe transformations, if we need to place a safety mechanism to remove special random characters from complete pandas dataframe. What would be the best way ?
Example DF :
data = {'Name':['Tom', 'ni�ck', 'kri�sh�', 'ja®\u00AEck'], 'Age':[20®\u00AE, 21, 1®\u00AE9, 18]}
What would be the best way to wipe off any random special characters apart from the ASCII characters? Lambda function to run on every row and every column? Or any better way for the same ? Inspite of running df['Name'] = df['Name'].str.encode('ascii', 'ignore').str.decode('ascii')
on every individual column, any particular way to do it at global columns level?
Expected resulting DF:
Name ID
0 Tom 20FE
1 nick 21BG
2 krish 192RF
3 jack 18XR'''
CodePudding user response:
Try:
df = pd.DataFrame({
'Name':['Tom', 'ni�ck', 'kri�sh�', 'ja®\u00AEck'],
'Age':['20®\u00AE', 21, '1®\u00AE9', 18]
})
Name Age
0 Tom 20®®
1 ni�ck 21
2 kri�sh� 1®®9
3 ja®®ck 18
for col in df.columns:
f_ = lambda x: ''.join(ch for ch in str(x) if ch.isalnum())
df[col] = df[col].apply(f_)
Output:
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
Here is a link
why the string method is preferable, comparing to a regex one.
CodePudding user response:
Credit goes to @not a robot!
(If you turn your comment into an answer - and I really think it deserves to be one -, I'll delete mine and upvote yours!)
Very interesting and useful approach to get rid of all the ASCII characters
with the use of regex
:
df.replace('[^ -~] ', '', regex=True)
index | Name | Age |
---|---|---|
0 | Tom | 20 |
1 | nick | 21 |
2 | krish | 19 |
3 | jack | 18 |