I am try to delete an icons which appears in many rows of my csv file. When I create a dataframe object using pd.read_csv it shows a green squared check icon, but if I open the csv using Excel I see ✅ instead. I tried to delete using split function because the verification status is separated by | to the comment:
df['reviews'] = df['reviews'].apply(lambda x: x.split('|')[1])
I noticed it didn't detect the "|" separator when the review contains the icon mentioned above.
I am not sure if it is an encoding problem. I tried to add encoding='utf-8' in pandas read_csv but It didn't solve the problem.
Thanks in advance.
I would like to add, this is a pic when I open the csv file using Excel.
CodePudding user response:
You can remove non-latin characters using encode/decode
methods:
>>> df
reviews
0 ✓ Trip Verified
1 Verified
>>> df['reviews'].str.encode('latin1', errors='ignore').str.decode('latin1')
0 Trip Verified
1 Verified
Name: reviews, dtype: object
CodePudding user response:
Say you had the following dataframe:
reviews
0 ✅ Trip Verified
1 Not Verified
2 Not Verified
3 ✅ Trip Verified
You can use the replace method to replace the ✅ symbol which is unicode character 2705
.
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
Here is the full example:
Code:
import pandas as pd
df = pd.DataFrame({"reviews":['\u2705 Trip Verified', 'Not Verified', 'Not Verified', '\u2705 Trip Verified']})
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
print(df)
Output:
reviews
0 Trip Verified
1 Not Verified
2 Not Verified
3 Trip Verified