Change characters like á é í ó ú ñ to their respective without accent in DataFrame-CodePudding

I am working with some open data through Deep Note with the pandas library and since it is in Spanish there are accents and characters like 'ñ' in the DataFrame

Searching I have been able to solve part of the problem by putting 'encoding'. The problem is when I publish the page that they appear as strange signs because of the accents like 'á é í ó ú ñ' and then I would like to know if there is any way to read the columns that contain words and change it to their respective without accent.

datos = pd.read_csv("/work/avisos",delimiter = ';', encoding="ISO-8859-1")

CodePudding user response：

How about just replacing exactly the symbols that you know to be problematic?

mapping = {'á': 'a',
           'é': 'e',
           'í': 'i',
           'ó': 'o',
           'ú': 'u',
           'ñ': 'n'}

df.col_name.replace(mapping, regex=True)

This took ~20 seconds for a 1M row DataFrame with ~700 characters per row.

CodePudding user response：

import unicodedata

def remove_accents(x):
    return (unicodedata.normalize('NFD', x)
                       .encode('ascii', 'ignore')
                       .decode('utf-8'))


word_cols = df.dtypes[lambda x: x.eq('object')].index.tolist()
df[word_cols] = df[word_cols].applymap(remove_accents)

Adapted from: How to replace accented characters?

This being said, you may only need to do:

    return unicodedata.normalize('NFD', x)

For the accents to appear as expected on the published page ~