Home > database >  Removing random special characters from complete dataframe
Removing random special characters from complete dataframe

Time:06-10

Post interim dataframe transformations, if we need to place a safety mechanism to remove special random characters from complete pandas dataframe. What would be the best way ?

Example DF : data = {'Name':['Tom', 'ni�ck', 'kri�sh�', 'ja®\u00AEck'], 'Age':[20®\u00AE, 21, 1®\u00AE9, 18]}

What would be the best way to wipe off any random special characters apart from the ASCII characters? Lambda function to run on every row and every column? Or any better way for the same ? Inspite of running df['Name'] = df['Name'].str.encode('ascii', 'ignore').str.decode('ascii') on every individual column, any particular way to do it at global columns level?

Expected resulting DF:

       Name       ID
0      Tom      20FE
1      nick     21BG
2      krish   192RF
3      jack     18XR'''

CodePudding user response:

Try:

df = pd.DataFrame({
  'Name':['Tom', 'ni�ck', 'kri�sh�', 'ja®\u00AEck'], 
  'Age':['20®\u00AE', 21, '1®\u00AE9', 18]
})

      Name   Age
0      Tom  20®®
1    ni�ck    21
2  kri�sh�  1®®9
3   ja®®ck    18

for col in df.columns:
     f_ = lambda x: ''.join(ch for ch in str(x) if ch.isalnum())
     df[col] = df[col].apply(f_)

Output:

    Name Age
0    Tom  20
1   nick  21
2  krish  19
3   jack  18

Here is a link why the string method is preferable, comparing to a regex one.

CodePudding user response:

Credit goes to @not a robot!

(If you turn your comment into an answer - and I really think it deserves to be one -, I'll delete mine and upvote yours!)

Very interesting and useful approach to get rid of all the ASCII characters with the use of regex:

df.replace('[^ -~] ', '', regex=True)
index Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
  • Related