I have a dataframe with a column containing list of strings.
id sentence category
0 "I love basketball and dunk to the basket" ['basketball']
1 "I am playing football and basketball tomorrow " ['football', 'basketball']
I would like to do 2 things:
-
- Transform category column where every elements from previous list become a string and have one row for each string and with same id and sentence
-
- Have one dataframe by category
Expected output for step 1):
id sentence category
0 "I love basketball and dunk to the basket" 'basketball'
1 "I am playing football and tomorrow basketball" 'football'
1 "I am playing football and tomorrow basketball" 'basketball'
Expected output for step 2):
DF_1
id sentence category
0 "I love basketball and dunk to the basket" 'basketball'
1 "I am playing football and tomorrow basketball" 'basketball'
DF_2
id sentence category
1 "I am playing football and tomorrow basketball" 'football'
How can I do this ? For each and examine len of each list can work, but is there a more faster/elegant way ?
CodePudding user response:
You could explode
"category"; then groupby
:
out = [g for _, g in df.explode('category').groupby('category')]
Then if you print the items in out
:
for i in out:
print(i, end='\n\n')
you'll see:
id sentence category
0 0 I love basketball and dunk to the basket basketball
1 1 I am playing football and basketball tomorrow basketball
id sentence category
1 1 I am playing football and basketball tomorrow football
CodePudding user response:
You'll need two tools : explode and groupby.
First let's prepare our data, and ensure explode will work with literal_eval :
import pandas as pd
from io import StringIO
from ast import literal_eval
csvfile = StringIO(
"""id\tsentence\tcategory
0\t"I love basketball and dunk to the basket"\t["basketball"]
1\t"I am playing football and basketball tomorrow "\t["football", "basketball"]""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df.loc[:, 'category'] = df.loc[:, 'category'].apply(literal_eval)
Then explode regarding your category columns :
df = df.explode('category')
Finally, you can use groupby
as a dictionary and store your sub dataframes elsewhere :
dg = df.groupby('category')
list_dg = []
for n, g in dg:
list_dg.append(g)
Imo, I will stick with dg
if possible