Removing unwanted characters in python pandas-CodePudding

I have a pandas dataframe column like below :

  |  ColumnA    |
   ------------- 
  |    ABCD(!)  |
  | <DEFG>(23)  |
  | (MNPQ. )    |
  |    32.JHGF  |
  |    "QWERT"  |

Aim is to remove the special characters and produce the output as below :

  |  ColumnA   |
   ------------ 
  |    ABCD    |
  |    DEFG    |
  |    MNPQ    |
  |    JHGF    |
  |    QWERT   |

Tried using the replace method like below, but without success :

df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z\d\_] ", "", regex=True)
print(df)

So, how can I replace the special characters using replace method in pandas?

CodePudding user response：

Your solution is also for get numbers \d and _, so it remove only:

df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z] ", "")
print (df)
  ColumnA
0    ABCD
1    DEFG
2    MNPQ
3    JHGF
4   QWERT

CodePudding user response：

regrex should be r'[^a-zA-Z] ', it means keep only the characters that are from A to Z, a-z

import pandas as pd

#   |  ColumnA    |
#    ------------- 
#   |    ABCD(!)  |
#   | <DEFG>(23)  |
#   | (MNPQ. )    |
#   |    32.JHGF  |
#   |    "QWERT"  |

# create a dataframe from a list 
df = pd.DataFrame(['ABCD(!)', 'DEFG(23)', '(MNPQ. )', '32.JHGF', 'QWERT'], columns=['ColumnA'])

 #   |  ColumnA   |
#    ------------ 
#   |    ABCD    |
#   |    DEFG    |
#   |    MNPQ    |
#   |    JHGF    |
#   |    QWERT   |
# keep only the characters that are from A to Z, a-z 
df['ColumnB'] =df['ColumnA'].str.replace(r'[^a-zA-Z] ', '')

print(df['ColumnB'])

Result:

0     ABCD
1     DEFG
2     MNPQ
3     JHGF
4    QWERT

CodePudding user response：

Your suggested code works fine on my installation with only extra digits so that you need to update your regex statement: r"[^a-zA-Z] " If this doesn't work, then maybe try to update your pandas;

import pandas as pd

d = {'Column A': [' ABCD(!)', '<DEFG>(23)', '(MNPQ. )', ' 32.JHGF',  '"QWERT"']}
df = pd.DataFrame(d)
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z] ", "", regex=True)
print(df)

Output

  Column A
0     ABCD
1     DEFG
2     MNPQ
3     JHGF
4    QWERT