I am trying to remove the 'em dash' from a pandas DataFrame. I have tried a bunch of different suggestions. But none of them are working.
This is what I have got
# -*- coding: utf-8 -*-
my_str = 'word—word'
# my_str.replace(b'\xe2\x80\x94'.decode('utf-8'), '--')
my_str.replace('—'.decode('utf8'), 'xx')
#my_str.replace('\u2014', 'x')
print(my_str)
I am trying to simply replace the em dash '—' with something else.
I am always getting an error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)
am stuck on this. any help appreciated. thanks
Links I have used:
===========
Solution thanks to @Corralien. I am adding a few more points to solve this problem, since it was driving me bonkers.
First check the unicode character you are dealing with using
print(ord(u'—'))
then look it up here
find the python source code for it. in my case it was returned as 8212 which is in python
u"\u2014"
in python2.7 you need to start the string with 'u' since only from version 3 onwards python interprets every string as unicode by default. so for python 3 use
"\u2014"
fyi lastly the utf-8 hex representation works as well
.str.replace('\xe2\x80\x94', ' ')
which corresponds to the utf-8 hex representation
0xE2 0x80 0x94
i hope this helps future visitors, at least i will be able to look this up next time before wasting hours debugging.
PS. in order for str.replace to work you need to use it on a single column only. str.replace doesnt work for multicolumn dataframes. so i used it as such:
df['column'] = df['column'].str.replace(u'\u2014', ' ')
replacing the column in question with a column where the emdash is replaced with a .
CodePudding user response:
Use the unicode representation of 'Em dash'
>>> my_str.replace('\u2014', '*')
'word*word'
From a dataframe:
>>> df['col'].str.replace('\u2014', '*')
0 word*word
Name: col, dtype: object