EDIT: I need to access the ds data and normalize the text. I tried changing json to df to try some functions I found but no success. In the end it doesn't matter if I have a dataframe, json... The main issue is to able to normalize the text.
From the following =
data = [
{
"id": 504,
"ds": "A description with ressonância magnética"
},
{
"id": 505,
"ds": "Another description that contains word with accentuation"
}]
I'm changing it to a pandas DataFrame
df = pd.DataFrame(data)
print(df['ds'])
And I try to access df['ds'] to use .apply(unicodedata.normalize('NFKD', df['ds'])
because I need to remove all words with any type of accentuation, eg. 'à', 'â', 'ã', etc.
But I get 'AttributeError: 'str' object has no attribute 'apply''
Other thing I tried was
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
df['ds'] = df['ds'].apply(remove_accents)
But I get the error 'TypeError: normalize() argument 2 must be str, not bytes'
I'm new to python so forgive me lol But I've tried many things.
Any help is appreciated!
CodePudding user response:
To convert dataframe to json.
df.to_json("data.json", orient="records")
I have the following function :
def remove_accents(input_str):
import unicodedata
nkfd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nkfd_form.encode('ASCII', 'ignore')
return only_ascii.decode('utf-8')
df['ds'] = df['ds'].apply(remove_accents)
With the following df :
data = [
{
"id": 504,
"ds": "A description of the product"
},
{
"id": 505,
"ds": "Another description that contâins word with ãccentuation"
}]
And my output is :
id ds
0 504 A description of the product
1 505 Another description that contains word with accentuation
Running the function without .decode('utf-8')
will give you
id ds
0 504 b'A description of the product'
1 505 b'Another description that contains word with accentuation'
Hope this helps.
CodePudding user response:
If you want to get rid of non-ASCII characters, just encode to ascii, then decode again:
>>> df['ds'].str.encode('ascii', 'ignore').str.decode('ascii')
0 A description of the product
1 Another description that contains word with ac...
Name: ds, dtype: object
Similarly you can pass utf-8
if you want to keep the utf
characters
CodePudding user response:
Define data with accentuation
use also german/ french accents
data = [
{"id": 504,
"ds": "A description of the product"
},
{"id": 505,
"ds": "Another dèscription thàt contains word with accentuation"
}]
df = pd.DataFrame(data)
print(df['ds'])
Pandas DataFrame .apply
df['ds'].apply(lambda x: unicodedata.normalize('NFKD', x))
Pandas Series string methods
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.normalize.html
At this point your accentuated characters are divided into the character and the relative accent(è à é, german, french...).
You can use by your own the string.translate amd string.maketrans