How transform list of strings in column and split dataframe by same string to have several?-CodePudding

I have a dataframe with a column containing list of strings.

id sentence                                            category
0  "I love basketball and dunk to the basket"          ['basketball']
1  "I am playing football and basketball tomorrow "    ['football', 'basketball']

I would like to do 2 things:

1. Transform category column where every elements from previous list become a string and have one row for each string and with same id and sentence
1. Have one dataframe by category

Expected output for step 1):

id sentence                                            category
0  "I love basketball and dunk to the basket"          'basketball'
1  "I am playing football and tomorrow basketball"     'football'
1  "I am playing football and tomorrow basketball"     'basketball'

Expected output for step 2):

DF_1

id sentence                                            category
0  "I love basketball and dunk to the basket"          'basketball'
1  "I am playing football and tomorrow basketball"     'basketball'

DF_2

id sentence                                            category
1  "I am playing football and tomorrow basketball"     'football'

How can I do this ? For each and examine len of each list can work, but is there a more faster/elegant way ?

CodePudding user response：

You could explode "category"; then groupby:

out = [g for _, g in df.explode('category').groupby('category')]

Then if you print the items in out:

for i in out:
    print(i, end='\n\n')

you'll see:

   id                                        sentence    category
0   0        I love basketball and dunk to the basket  basketball
1   1  I am playing football and basketball tomorrow   basketball

   id                                        sentence  category
1   1  I am playing football and basketball tomorrow   football

CodePudding user response：

You'll need two tools : explode and groupby.

First let's prepare our data, and ensure explode will work with literal_eval :

import pandas as pd
from io import StringIO
from ast import literal_eval

csvfile = StringIO(
"""id\tsentence\tcategory
0\t"I love basketball and dunk to the basket"\t["basketball"]
1\t"I am playing football and basketball tomorrow "\t["football", "basketball"]""")

df = pd.read_csv(csvfile, sep = '\t', engine='python')

df.loc[:, 'category'] = df.loc[:, 'category'].apply(literal_eval)

Then explode regarding your category columns :

df = df.explode('category')

Finally, you can use groupby as a dictionary and store your sub dataframes elsewhere :

dg = df.groupby('category')

list_dg = []

for n, g in dg:
    list_dg.append(g)

Imo, I will stick with dg if possible