Best way to clean column in pandas-CodePudding

I have been trying to clean a particular column from a dataset. I am using the function .apply() multiple times in order to throw out any symbol that could be in in the string values of the column.

For each symbol, here's the function : .apply(lambda x: x.replace("", ""))

Although my code works, it is quite long and not that clean. I would like to know if there is a shorter and/or better manner of cleaning a column.

Here is my code:

df_reviews = pd.read_csv("reviews.csv")
df_reviews = df_reviews.rename(columns={"Unnamed: 0" : "index", "0" : "Name"})
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]


df_reviews['name'] = df_reviews['name'].apply(lambda x: x.replace("Review", "")).apply(lambda x: x.replace(":", "")).apply(lambda x: x.replace("'", "")).apply(lambda x: x.replace('"', "")).apply(lambda x: x.replace("#", ""))\
                                .apply(lambda x: x.replace("{", "")).apply(lambda x: x.replace("}", "")).apply(lambda x: x.replace("_", "")).apply(lambda x: x.replace(":", ""))



df_reviews['name'] = df_reviews['name'].str.strip()

As you can see, the many .apply() functions makes it difficult to clearly see what is getting removed from the "name" column.

Could someone help me ?

Kind regards

CodePudding user response：

You can also use regex:

df_reviews['name'] = df_reviews['name'].str.replace('Review|[:\'"#{}_]', "", regex=True)

Regex pattern:

'Review|[:\'"#{}_]'

Review : replace the word "Review"
| : or
[:\'"#{}_] - any of these characters within the square brackets []

Note:

If you are looking to remove ALL punctuation: you can use this instead

import string

df_reviews['name'] = df_reviews['name'].str.replace(f'Review|[{string.punctuation}]', "", regex=True)

Which will remove the following characters:

!"#$%&\'()* ,-./:;<=>?@[\\]^_`{|}~

CodePudding user response：

Try this one:

df['name'] = df['name'].str.replace('Review| \:| \'|\"|\#| \_', "").str.strip()