splitting string made out of dataframe row wise-CodePudding

I'm trying to tokenize the words within dataframe which looks like

After removing all the special characters the dataframe became a string like this

or perhaps this row wise organised one might be easier to see

so when I tokenize this string, with below code

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj
    else:
        return obj

tokenized_text = (tokenize(text))

I get the output

which is quite different from the output I expected

Any ideas on how can I get the output I expected? Any help would be greatly appreciated!

CodePudding user response：

Don't convert DataFrame to string but work with every text in DataFrame separatelly.

Use.applymap(function) to execute function on every text (on every cell in DataFrame).

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

Minimal working example:

import pandas as pd
from nltk.tokenize import word_tokenize

data = {
    'Background': ['Orange', 'Orange', 'Aqua'], 
    'Fur': ['Robot', 'Robot', 'Robot'], 
    'Eyes': ['X Eyes', 'Blue Beams', '3d'],
    'Mouth': ['Discomfort', 'Grin', 'Bored Cigarette'],
    'Clothes': ['Striped Tee', 'Vietman Jacket', None],
    'Hat': [None, None, "Sea Captain's Hat"],
}

df = pd.DataFrame(data)

print(df.to_string())  # `to_string()` to display full dataframe without `...`

# ----------------------------------------

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj]
    else:
        return obj

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

print(result)

Result:

  Background    Fur        Eyes            Mouth         Clothes                Hat
0     Orange  Robot      X Eyes       Discomfort     Striped Tee               None
1     Orange  Robot  Blue Beams             Grin  Vietman Jacket               None
2       Aqua  Robot          3d  Bored Cigarette            None  Sea Captain's Hat

[
  [['Orange'], ['Robot'], ['X', 'Eyes'], ['Discomfort'], ['Striped', 'Tee'], None], 
  [['Orange'], ['Robot'], ['Blue', 'Beams'], ['Grin'], ['Vietman', 'Jacket'], None], 
  [['Aqua'], ['Robot'], ['3d'], ['Bored', 'Cigarette'], None, ['Sea', 'Captain', "'s", 'Hat']]
]