Home > front end >  How transform list of strings in column and split dataframe by same string to have several?
How transform list of strings in column and split dataframe by same string to have several?

Time:03-25

I have a dataframe with a column containing list of strings.

id sentence                                            category
0  "I love basketball and dunk to the basket"          ['basketball']
1  "I am playing football and basketball tomorrow "    ['football', 'basketball']

I would like to do 2 things:

    1. Transform category column where every elements from previous list become a string and have one row for each string and with same id and sentence
    1. Have one dataframe by category

Expected output for step 1):

id sentence                                            category
0  "I love basketball and dunk to the basket"          'basketball'
1  "I am playing football and tomorrow basketball"     'football'
1  "I am playing football and tomorrow basketball"     'basketball'

Expected output for step 2):

DF_1

id sentence                                            category
0  "I love basketball and dunk to the basket"          'basketball'
1  "I am playing football and tomorrow basketball"     'basketball'

DF_2

id sentence                                            category
1  "I am playing football and tomorrow basketball"     'football'

How can I do this ? For each and examine len of each list can work, but is there a more faster/elegant way ?

CodePudding user response:

You could explode "category"; then groupby:

out = [g for _, g in df.explode('category').groupby('category')]

Then if you print the items in out:

for i in out:
    print(i, end='\n\n')

you'll see:

   id                                        sentence    category
0   0        I love basketball and dunk to the basket  basketball
1   1  I am playing football and basketball tomorrow   basketball

   id                                        sentence  category
1   1  I am playing football and basketball tomorrow   football

CodePudding user response:

You'll need two tools : explode and groupby.

First let's prepare our data, and ensure explode will work with literal_eval :

import pandas as pd
from io import StringIO
from ast import literal_eval

csvfile = StringIO(
"""id\tsentence\tcategory
0\t"I love basketball and dunk to the basket"\t["basketball"]
1\t"I am playing football and basketball tomorrow "\t["football", "basketball"]""")

df = pd.read_csv(csvfile, sep = '\t', engine='python')

df.loc[:, 'category'] = df.loc[:, 'category'].apply(literal_eval)

Then explode regarding your category columns :

df = df.explode('category')

Finally, you can use groupby as a dictionary and store your sub dataframes elsewhere :

dg = df.groupby('category')

list_dg = []

for n, g in dg:
    list_dg.append(g)

Imo, I will stick with dg if possible

  • Related