I want to create a unique dataset of fruits. I don't know all the types (e.g. colour store, price) that could be under each fruit. For each type, there could also be duplicate rows. Is there a way to detect all possible duplicates and capture all unique informoation in a fully generalisable way?
type val detail
0 fruit apple
1 colour green greenish
2 colour yellow
3 store walmart usa
4 price 10
5 NaN
6 fruit banana
7 colour yellow
8 fruit pear
9 fruit jackfruit
...
Expected Output
fruit colour store price detail ...
0 apple [green, yellow ] [walmart] [10] [greenish, usa]
1 banana [yellow] NaN NaN
2 pear NaN NaN NaN
3 jackfruit NaN NaN NaN
I tried. But this does not get close to the expected output. It does not show the colum names either.
df.groupby("type")["val"].agg(size=len, set=lambda x: set(x))
0 fruit {"apple",...}
1 colour ...
CodePudding user response:
First is created fruit
column with val
values if type is fruit
, replace non matched values to NaN
s and forward filling missing values, then pivoting by DataFrame.pivot_table
with custom function for unique values without NaN
s and then flatten MultiIndex
:
m = df['type'].eq('fruit')
df['fruit'] = df['val'].where(m).ffill()
df1 = (df.pivot_table(index='fruit',columns='type',
aggfunc=lambda x: list(dict.fromkeys(x.dropna())))
.drop('fruit', axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
detail_colour detail_price detail_store val_colour val_price \
fruit
apple [greenish] [] [usa] [green, yellow] [10]
banana [] NaN NaN [yellow] NaN
jackfruit NaN NaN NaN NaN NaN
pear NaN NaN NaN NaN NaN
val_store
fruit
apple [walmart]
banana NaN
jackfruit NaN
pear NaN