I want to remove symbols (Most of them but not all) from my data column 'Review'. A little background on my code:
from pandas.core.frame import DataFrame
# convert to lower case
data['Review'] = data['Review'].str.lower()
# remove trailing white spaces
data['Review'] = data['Review'].str.strip()
This is what I did based on what I read on the internet (I'm still on the beginner-level of NLP, so don't be surprised to find more than one mistake, I just want to know what are they):
import string
sep = '|'
punctuation_chars = '"#$%&\()* ,-./:;<=>?@[\\]^_`{}~'
mapping_table = str.maketrans(dict.fromkeys(punctuation_chars, ''))
= sep.join(df[df(data['Review']).tolist()]).translate(mapping_table).split(sep)
However, I get the following error:
AttributeError: 'DataFrame' object has no attribute 'tolist'
How could I solve it? I want to use .translate() because I read it's more efficient than other methods.
CodePudding user response:
The AttributeError
is caused because DataFrame.tolist()
doesn't exist. It looks like the code assumes that df(data['Review'])
is a Series
, but it is actually a DataFrame
.
df = DataFrame(data['Review'])
translated_reviews = sep.join(df[0].tolist()).translate(mapping_table).split(sep)
It's unclear whether data
is a DataFrame
. If it is, just use it in the join()
without calling tolist()
or instantiating a new DataFrame
.
translated_reviews = sep.join(data['Review']).translate(mapping_table).split(sep)
CodePudding user response:
Your problem was where you were trying to create a dataframe object from a column of your data dataframe and then convert that to list df[df(data['Review']).tolist()]
(that part). You can either use
df.values.tolist()
which would convert the whole dataframe, df, to a list or if you just want to convert a column use data['Review'].tolist()
So in your situation the final line of your code would be switched to
data['Review'] = sep.join(data['Review'].tolist()).translate(mapping_table).split(sep)