Home > database >  Normalize strings in a dataframe or json list - python
Normalize strings in a dataframe or json list - python

Time:09-11

EDIT: I need to access the ds data and normalize the text. I tried changing json to df to try some functions I found but no success. In the end it doesn't matter if I have a dataframe, json... The main issue is to able to normalize the text.

From the following =

data = [
        {
            "id": 504,
            "ds": "A description with ressonância magnética"
        },
        {   
            "id": 505,
            "ds": "Another description that contains word with accentuation"
        }]

I'm changing it to a pandas DataFrame df = pd.DataFrame(data) print(df['ds'])

And I try to access df['ds'] to use .apply(unicodedata.normalize('NFKD', df['ds']) because I need to remove all words with any type of accentuation, eg. 'à', 'â', 'ã', etc.

But I get 'AttributeError: 'str' object has no attribute 'apply''

Other thing I tried was

        def remove_accents(input_str):
            nfkd_form = unicodedata.normalize('NFKD', input_str)
            only_ascii = nfkd_form.encode('ASCII', 'ignore')
            return only_ascii
        
        df['ds'] = df['ds'].apply(remove_accents)

But I get the error 'TypeError: normalize() argument 2 must be str, not bytes'

I'm new to python so forgive me lol But I've tried many things.

Any help is appreciated!

CodePudding user response:

To convert dataframe to json. df.to_json("data.json", orient="records")

I have the following function :

def remove_accents(input_str):
    import unicodedata
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii.decode('utf-8')
df['ds'] = df['ds'].apply(remove_accents)

With the following df :

data = [
        {
            "id": 504,
            "ds": "A description of the product"
        },
        {
            "id": 505,
            "ds": "Another description that contâins word with ãccentuation"
        }]

And my output is :

    id                                                        ds
0  504                              A description of the product
1  505  Another description that contains word with accentuation

Running the function without .decode('utf-8') will give you

    id                                                           ds
0  504                              b'A description of the product'
1  505  b'Another description that contains word with accentuation'

Hope this helps.

CodePudding user response:

If you want to get rid of non-ASCII characters, just encode to ascii, then decode again:

>>> df['ds'].str.encode('ascii', 'ignore').str.decode('ascii')

0                         A description of the product
1    Another description that contains word with ac...
Name: ds, dtype: object

Similarly you can pass utf-8 if you want to keep the utf characters

CodePudding user response:

Define data with accentuation

use also german/ french accents

data = [
      {"id": 504,
       "ds": "A description of the product"
      },
      {"id": 505,
       "ds": "Another dèscription thàt contains word with accentuation"
      }]

df = pd.DataFrame(data)
print(df['ds'])

Pandas DataFrame .apply

df['ds'].apply(lambda x: unicodedata.normalize('NFKD', x))

Pandas Series string methods

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.normalize.html


At this point your accentuated characters are divided into the character and the relative accent(è à é, german, french...).

You can use by your own the string.translate amd string.maketrans

  • Related