Home > OS >  splitting string made out of dataframe row wise
splitting string made out of dataframe row wise

Time:09-20

I'm trying to tokenize the words within dataframe which looks like dataframe

After removing all the special characters the dataframe became a string like this string1

or perhaps this row wise organised one might be easier to seestring 2

so when I tokenize this string, with below code

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj
    else:
        return obj

tokenized_text = (tokenize(text))

I get the output

tokenized_text

which is quite different from the output I expected expected output

Any ideas on how can I get the output I expected? Any help would be greatly appreciated!

CodePudding user response:

Don't convert DataFrame to string but work with every text in DataFrame separatelly.

Use.applymap(function) to execute function on every text (on every cell in DataFrame).

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

Minimal working example:

import pandas as pd
from nltk.tokenize import word_tokenize

data = {
    'Background': ['Orange', 'Orange', 'Aqua'], 
    'Fur': ['Robot', 'Robot', 'Robot'], 
    'Eyes': ['X Eyes', 'Blue Beams', '3d'],
    'Mouth': ['Discomfort', 'Grin', 'Bored Cigarette'],
    'Clothes': ['Striped Tee', 'Vietman Jacket', None],
    'Hat': [None, None, "Sea Captain's Hat"],
}

df = pd.DataFrame(data)

print(df.to_string())  # `to_string()` to display full dataframe without `...`

# ----------------------------------------

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj]
    else:
        return obj

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

print(result)

Result:

  Background    Fur        Eyes            Mouth         Clothes                Hat
0     Orange  Robot      X Eyes       Discomfort     Striped Tee               None
1     Orange  Robot  Blue Beams             Grin  Vietman Jacket               None
2       Aqua  Robot          3d  Bored Cigarette            None  Sea Captain's Hat

[
  [['Orange'], ['Robot'], ['X', 'Eyes'], ['Discomfort'], ['Striped', 'Tee'], None], 
  [['Orange'], ['Robot'], ['Blue', 'Beams'], ['Grin'], ['Vietman', 'Jacket'], None], 
  [['Aqua'], ['Robot'], ['3d'], ['Bored', 'Cigarette'], None, ['Sea', 'Captain', "'s", 'Hat']]
]
  • Related