Find unique value for each category-CodePudding

I have a two columns:

category     names
vegetables   [broccoli, ginger]
fruit        [apple, grapes, dragonfruit]
vegetables   [pine]
vegetables   [bottleguord, pumpkin]
fruit        [mango, guava]

I need to find unique values that each category contains. This is how you could create a new df

import numpy as np
import pandas as pd    
df = pd.DataFrame({'category':['vegetables', 'fruit', 'vegetables', 'vegetables', 'fruit'],
                       'Names':['[broccoli, ginger]','[apple, grapes, dragonfruit]','[pine]','[bottleguord, pumpkin]', '[mango, guava]']})

This is how I was trying to do it.

g = df.groupby('category')['names'].apply(lambda x: list(np.unique(x)))

Expected output:

index = ['Vegetables' ,'Fruits']
new_df = pd.DataFrame(index=index)

This is how I tweaked the code: print(df.assign(Names=df['Names'].str[1:-1].str.split(', ')).explode('Names').groupby('Category')['Names'].apply(lambda x: len(set(x))))

catgeory    len_unique_val
Vegetables   5
Fruits       4

CodePudding user response：

You can explode, groupby and apply to transform as python set.

assuming lists as input:

(df.explode('names')
   .groupby('category')
   ['names']
   .apply(set)
)

output:

category
fruit             {dragonfruit, guava, grapes, apple, mango}
vegetables    {ginger, pine, broccoli, bottleguord, pumpkin}

assuming strings as input:

(df.assign(Names=df['Names'].str[1:-1].str.split(', '))
   .explode('Names')
   .groupby('category')
   ['Names']
   .apply(lambda x: '[' ', '.join(set(x)) ']')
)

output:

category
fruit             [dragonfruit, guava, grapes, apple, mango]
vegetables    [ginger, pine, broccoli, bottleguord, pumpkin]