I am trying to remove all
\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t
type characters from the below strings in Python pandas column. Although the text starts with b' , it's a string
Text
_____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6
"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '
"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
"b'berates climate change activist who confronted her in airport\xc2\xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.
CodePudding user response:
You could go through your strings and keep only ascii characters:
my_str = "b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6"
new_str = "".join(c for c in my_str if c.isascii())
print(new_str)
Note that .encode('ascii', errors= 'ignore')
doesn't change the string it's applied to but returns the encoded string. This should work:
new_str = my_str.encode('ascii',errors='ignore')
print(new_str)
CodePudding user response:
Try something like this:
>>> df['text'].str.replace(r'\\x[0-9a-f]{2}', '', regex=True)
0 b'Hello! End Climate Silence is looking for v...
1 b'I doubt if climate emergency 8s real, I thin...
2 b'No, thankfully it doesnt. Cant see how cheap...
3 b'Climate Change Poses a WidelllThreat to Nati...
4 b""This doesn't feel like targeted propaganda ...
5 b'berates climate change activist who confront...
Name: text, dtype: object