I have a pandas dataframe column like below :
| ColumnA |
-------------
| ABCD(!) |
| <DEFG>(23) |
| (MNPQ. ) |
| 32.JHGF |
| "QWERT" |
Aim is to remove the special characters and produce the output as below :
| ColumnA |
------------
| ABCD |
| DEFG |
| MNPQ |
| JHGF |
| QWERT |
Tried using the replace
method like below, but without success :
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z\d\_] ", "", regex=True)
print(df)
So, how can I replace the special characters using replace method in pandas?
CodePudding user response:
Your solution is also for get numbers \d
and _
, so it remove only:
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z] ", "")
print (df)
ColumnA
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
CodePudding user response:
regrex should be r'[^a-zA-Z] '
, it means keep only the characters that are from A to Z, a-z
import pandas as pd
# | ColumnA |
# -------------
# | ABCD(!) |
# | <DEFG>(23) |
# | (MNPQ. ) |
# | 32.JHGF |
# | "QWERT" |
# create a dataframe from a list
df = pd.DataFrame(['ABCD(!)', 'DEFG(23)', '(MNPQ. )', '32.JHGF', 'QWERT'], columns=['ColumnA'])
# | ColumnA |
# ------------
# | ABCD |
# | DEFG |
# | MNPQ |
# | JHGF |
# | QWERT |
# keep only the characters that are from A to Z, a-z
df['ColumnB'] =df['ColumnA'].str.replace(r'[^a-zA-Z] ', '')
print(df['ColumnB'])
Result:
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
CodePudding user response:
Your suggested code works fine on my installation with only extra digits so that you need to update your regex statement: r"[^a-zA-Z] " If this doesn't work, then maybe try to update your pandas;
import pandas as pd
d = {'Column A': [' ABCD(!)', '<DEFG>(23)', '(MNPQ. )', ' 32.JHGF', '"QWERT"']}
df = pd.DataFrame(d)
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z] ", "", regex=True)
print(df)
Output
Column A
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT