Normalize strings in a dataframe or json list

EDIT: I need to access the ds data and normalize the text. I tried changing json to df to try some functions I found but no success. In the end it doesn't matter if I have a dataframe, json... The main issue is to able to normalize the text.

From the following =

data = [
        {
            "id": 504,
            "ds": "A description with ressonância magnética"
        },
        {   
            "id": 505,
            "ds": "Another description that contains word with accentuation"
        }]

I'm changing it to a pandas DataFrame df = pd.DataFrame(data) print(df['ds'])

And I try to access df['ds'] to use .apply(unicodedata.normalize('NFKD', df['ds']) because I need to remove all words with any type of accentuation, eg. 'à', 'â', 'ã', etc.

But I get 'AttributeError: 'str' object has no attribute 'apply''

Other thing I tried was

        def remove_accents(input_str):
            nfkd_form = unicodedata.normalize('NFKD', input_str)
            only_ascii = nfkd_form.encode('ASCII', 'ignore')
            return only_ascii
        
        df['ds'] = df['ds'].apply(remove_accents)

But I get the error 'TypeError: normalize() argument 2 must be str, not bytes'

I'm new to python so forgive me lol But I've tried many things.

Any help is appreciated!

CodePudding user response：

To convert dataframe to json. df.to_json("data.json", orient="records")

I have the following function :

def remove_accents(input_str):
    import unicodedata
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii.decode('utf-8')
df['ds'] = df['ds'].apply(remove_accents)

With the following df :

data = [
        {
            "id": 504,
            "ds": "A description of the product"
        },
        {
            "id": 505,
            "ds": "Another description that contâins word with ãccentuation"
        }]

And my output is :

    id                                                        ds
0  504                              A description of the product
1  505  Another description that contains word with accentuation

Running the function without .decode('utf-8') will give you

    id                                                           ds
0  504                              b'A description of the product'
1  505  b'Another description that contains word with accentuation'

Hope this helps.

CodePudding user response：

If you want to get rid of non-ASCII characters, just encode to ascii, then decode again:

>>> df['ds'].str.encode('ascii', 'ignore').str.decode('ascii')

0                         A description of the product
1    Another description that contains word with ac...
Name: ds, dtype: object

Similarly you can pass utf-8 if you want to keep the utf characters

CodePudding user response：

Define data with accentuation

use also german/ french accents

data = [
      {"id": 504,
       "ds": "A description of the product"
      },
      {"id": 505,
       "ds": "Another dèscription thàt contains word with accentuation"
      }]

df = pd.DataFrame(data)
print(df['ds'])

Pandas DataFrame .apply

df['ds'].apply(lambda x: unicodedata.normalize('NFKD', x))

Pandas Series string methods

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.normalize.html

At this point your accentuated characters are divided into the character and the relative accent(è à é, german, french...).

You can use by your own the string.translate amd string.maketrans