Removing random special characters from complete dataframe-CodePudding

Post interim dataframe transformations, if we need to place a safety mechanism to remove special random characters from complete pandas dataframe. What would be the best way ?

Example DF : data = {'Name':['Tom', 'ni�ck', 'kri�sh�', 'ja®\u00AEck'], 'Age':[20®\u00AE, 21, 1®\u00AE9, 18]}

What would be the best way to wipe off any random special characters apart from the ASCII characters? Lambda function to run on every row and every column? Or any better way for the same ? Inspite of running df['Name'] = df['Name'].str.encode('ascii', 'ignore').str.decode('ascii') on every individual column, any particular way to do it at global columns level?

Expected resulting DF:

       Name       ID
0      Tom      20FE
1      nick     21BG
2      krish   192RF
3      jack     18XR'''

CodePudding user response：

Try:

df = pd.DataFrame({
  'Name':['Tom', 'ni�ck', 'kri�sh�', 'ja®\u00AEck'], 
  'Age':['20®\u00AE', 21, '1®\u00AE9', 18]
})

      Name   Age
0      Tom  20®®
1    ni�ck    21
2  kri�sh�  1®®9
3   ja®®ck    18

for col in df.columns:
     f_ = lambda x: ''.join(ch for ch in str(x) if ch.isalnum())
     df[col] = df[col].apply(f_)

Output:

    Name Age
0    Tom  20
1   nick  21
2  krish  19
3   jack  18

Here is a link why the string method is preferable, comparing to a regex one.

CodePudding user response：

Credit goes to @not a robot!

(If you turn your comment into an answer - and I really think it deserves to be one -, I'll delete mine and upvote yours!)

Very interesting and useful approach to get rid of all the ASCII characters with the use of regex:

df.replace('[^ -~] ', '', regex=True)

index	Name	Age
0	Tom	20
1	nick	21
2	krish	19
3	jack	18