As part of feature engineering, I want to use the counts of a column after groupby as a feature of the model, This is what I have tried
>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
col1 col2 col3
0 a val1 val3
1 b val2 val4
2 a val2 val3
3 c val1 val4
4 a val2 val3
5 b val2 val4
>>> test = df.groupby('col1').agg(list)
col2 col3
col1
a [val1, val2, val2] [val3, val3, val3]
b [val2, val2] [val4, val4]
c [val1] [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
col2 col3
col1
a {'val1': 1, 'val2': 2} {'val3': 3}
b {'val2': 2} {'val4': 2}
c {'val1': 1} {'val4': 1}
Later I can expand the dicts into separate columns so the final output would be:
>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
I feel like there is a simpler solution, any help is appreciated.
CodePudding user response:
df2 = df.melt(id_vars='col1', value_name='count')
pd.crosstab(df2['col1'], df2['count'])
output:
count val1 val2 val3 val4
col1
a 1 2 3 0
b 0 2 0 2
c 1 0 0 1
If you want NaN
:
df3 = pd.crosstab(df2['col1'], df2['count'])
df3.mask(df3.eq(0))
output:
count val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
CodePudding user response:
df = pd.concat([df[['col1','col2']], df[['col1','col3']].rename(columns={"col3": "col2"})])
df = df.pivot_table(index = 'col1', columns = 'col2',aggfunc=len)
print(df)
output:
col2 val1 val2 val3 val4
col1
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0
CodePudding user response:
Another option that combines melt, groupby, and unstack:
(df.melt('col1')
.groupby(['col1', 'value'])
.size()
.unstack()
.rename_axis(index=None, columns=None)
)
val1 val2 val3 val4
a 1.0 2.0 3.0 NaN
b NaN 2.0 NaN 2.0
c 1.0 NaN NaN 1.0