I'm trying to tokenize the words within dataframe which looks like
After removing all the special characters the dataframe became a string like this
or perhaps this row wise organised one might be easier to see
so when I tokenize this string, with below code
def tokenize(obj):
if obj is None:
return None
elif isinstance(obj, str):
return word_tokenize(obj)
elif isinstance(obj, list):
return [tokenize(i) for i in obj
else:
return obj
tokenized_text = (tokenize(text))
I get the output
which is quite different from the output I expected
Any ideas on how can I get the output I expected? Any help would be greatly appreciated!
CodePudding user response:
Don't convert DataFrame
to string but work with every text in DataFrame
separatelly.
Use.applymap(function) to execute function on every text (on every cell in DataFrame
).
new_df = df.applymap(tokenize)
result = new_df.values.tolist()
Minimal working example:
import pandas as pd
from nltk.tokenize import word_tokenize
data = {
'Background': ['Orange', 'Orange', 'Aqua'],
'Fur': ['Robot', 'Robot', 'Robot'],
'Eyes': ['X Eyes', 'Blue Beams', '3d'],
'Mouth': ['Discomfort', 'Grin', 'Bored Cigarette'],
'Clothes': ['Striped Tee', 'Vietman Jacket', None],
'Hat': [None, None, "Sea Captain's Hat"],
}
df = pd.DataFrame(data)
print(df.to_string()) # `to_string()` to display full dataframe without `...`
# ----------------------------------------
def tokenize(obj):
if obj is None:
return None
elif isinstance(obj, str):
return word_tokenize(obj)
elif isinstance(obj, list):
return [tokenize(i) for i in obj]
else:
return obj
new_df = df.applymap(tokenize)
result = new_df.values.tolist()
print(result)
Result:
Background Fur Eyes Mouth Clothes Hat
0 Orange Robot X Eyes Discomfort Striped Tee None
1 Orange Robot Blue Beams Grin Vietman Jacket None
2 Aqua Robot 3d Bored Cigarette None Sea Captain's Hat
[
[['Orange'], ['Robot'], ['X', 'Eyes'], ['Discomfort'], ['Striped', 'Tee'], None],
[['Orange'], ['Robot'], ['Blue', 'Beams'], ['Grin'], ['Vietman', 'Jacket'], None],
[['Aqua'], ['Robot'], ['3d'], ['Bored', 'Cigarette'], None, ['Sea', 'Captain', "'s", 'Hat']]
]