Home > Mobile >  Pandas expand value counts after groupby as columns
Pandas expand value counts after groupby as columns

Time:11-19

As part of feature engineering, I want to use the counts of a column after groupby as a feature of the model, This is what I have tried

>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
   col1  col2  col3
0    a  val1  val3
1    b  val2  val4
2    a  val2  val3
3    c  val1  val4
4    a  val2  val3
5    b  val2  val4
>>> test = df.groupby('col1').agg(list)
                    col2                col3
col1
a     [val1, val2, val2]  [val3, val3, val3]
b           [val2, val2]        [val4, val4]
c                 [val1]              [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
                        col2         col3
col1
a     {'val1': 1, 'val2': 2}  {'val3': 3}
b                {'val2': 2}  {'val4': 2}
c                {'val1': 1}  {'val4': 1}

Later I can expand the dicts into separate columns so the final output would be:

>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
   val1 val2 val3 val4
a  1.0  2.0  3.0  NaN
b  NaN  2.0  NaN  2.0
c  1.0  NaN  NaN  1.0

I feel like there is a simpler solution, any help is appreciated.

CodePudding user response:

Yes there is, melt crosstab:

df2 = df.melt(id_vars='col1', value_name='count')
pd.crosstab(df2['col1'], df2['count'])

output:

count  val1  val2  val3  val4
col1                         
a         1     2     3     0
b         0     2     0     2
c         1     0     0     1

If you want NaN:

df3 = pd.crosstab(df2['col1'], df2['count'])
df3.mask(df3.eq(0))

output:

count  val1  val2  val3  val4
col1                         
a       1.0   2.0   3.0   NaN
b       NaN   2.0   NaN   2.0
c       1.0   NaN   NaN   1.0

CodePudding user response:

df = pd.concat([df[['col1','col2']], df[['col1','col3']].rename(columns={"col3": "col2"})])
df = df.pivot_table(index = 'col1', columns = 'col2',aggfunc=len)
print(df)

output:

col2  val1  val2  val3  val4
col1
a      1.0   2.0   3.0   NaN
b      NaN   2.0   NaN   2.0
c      1.0   NaN   NaN   1.0

CodePudding user response:

Another option that combines melt, groupby, and unstack:

(df.melt('col1')
   .groupby(['col1', 'value'])
   .size()
   .unstack()
   .rename_axis(index=None, columns=None)
)
       val1  val2  val3  val4
a       1.0   2.0   3.0   NaN
b       NaN   2.0   NaN   2.0
c       1.0   NaN   NaN   1.0
  • Related