I am working with some open data through Deep Note with the pandas library and since it is in Spanish there are accents and characters like 'ñ' in the DataFrame
Searching I have been able to solve part of the problem by putting 'encoding'. The problem is when I publish the page that they appear as strange signs because of the accents like 'á é í ó ú ñ' and then I would like to know if there is any way to read the columns that contain words and change it to their respective without accent.
datos = pd.read_csv("/work/avisos",delimiter = ';', encoding="ISO-8859-1")
CodePudding user response:
How about just replacing exactly the symbols that you know to be problematic?
mapping = {'á': 'a',
'é': 'e',
'í': 'i',
'ó': 'o',
'ú': 'u',
'ñ': 'n'}
df.col_name.replace(mapping, regex=True)
This took ~20 seconds for a 1M row DataFrame with ~700 characters per row.
CodePudding user response:
import unicodedata
def remove_accents(x):
return (unicodedata.normalize('NFD', x)
.encode('ascii', 'ignore')
.decode('utf-8'))
word_cols = df.dtypes[lambda x: x.eq('object')].index.tolist()
df[word_cols] = df[word_cols].applymap(remove_accents)
Adapted from: How to replace accented characters?
This being said, you may only need to do:
return unicodedata.normalize('NFD', x)
For the accents to appear as expected on the published page ~