I have been trying to clean a particular column from a dataset. I am using the function .apply() multiple times in order to throw out any symbol that could be in in the string values of the column.
For each symbol, here's the function : .apply(lambda x: x.replace("", ""))
Although my code works, it is quite long and not that clean. I would like to know if there is a shorter and/or better manner of cleaning a column.
Here is my code:
df_reviews = pd.read_csv("reviews.csv")
df_reviews = df_reviews.rename(columns={"Unnamed: 0" : "index", "0" : "Name"})
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
df_reviews['name'] = df_reviews['name'].apply(lambda x: x.replace("Review", "")).apply(lambda x: x.replace(":", "")).apply(lambda x: x.replace("'", "")).apply(lambda x: x.replace('"', "")).apply(lambda x: x.replace("#", ""))\
.apply(lambda x: x.replace("{", "")).apply(lambda x: x.replace("}", "")).apply(lambda x: x.replace("_", "")).apply(lambda x: x.replace(":", ""))
df_reviews['name'] = df_reviews['name'].str.strip()
As you can see, the many .apply() functions makes it difficult to clearly see what is getting removed from the "name" column.
Could someone help me ?
Kind regards
CodePudding user response:
You can also use regex:
df_reviews['name'] = df_reviews['name'].str.replace('Review|[:\'"#{}_]', "", regex=True)
Regex pattern:
'Review|[:\'"#{}_]'
Review
: replace the word "Review"|
: or[:\'"#{}_]
- any of these characters within the square brackets[]
Note:
If you are looking to remove ALL punctuation: you can use this instead
import string
df_reviews['name'] = df_reviews['name'].str.replace(f'Review|[{string.punctuation}]', "", regex=True)
Which will remove the following characters:
!"#$%&\'()* ,-./:;<=>?@[\\]^_`{|}~
CodePudding user response:
Try this one:
df['name'] = df['name'].str.replace('Review| \:| \'|\"|\#| \_', "").str.strip()