Home > Software engineering >  Get the whole data frame based on time freq and groupby
Get the whole data frame based on time freq and groupby

Time:12-07

Am trying to group by based on time freq for a dataframe. Can I get all the columns instead of just the specified columns in the group by.

code:

df.columns = ['time', 'age', 'salary', 'amount','university', 'gender', 'place', 'education']

DF:

time    age salary  amount  university  gender  place   education
12/6/2021   24  33333   232323  SK  M   US  BE
12/6/2021   24  33333   232323  SK  M   US  BE
12/8/2021   30  23656   9496    SE  F   UK  BARC
12/9/2021   34  65652   26266   DE  M   UK  BTECH
12/6/2021   25  89893   2652    NK  F   GER BSC
12/6/2021   25  89893   2652    NK  F   GER BSC
12/8/2021   70  445464  78989   SE  F   UK  BARC
12/9/2021   45  65656   225415  NK  F   GER BTECH
12/6/2021   29  5996    3232    NK  M   CAN BTECH

full_data = data.groupby([pd.Grouper(key='time', freq='4min'),'age', 'salary', 'amount','university']).size().reset_index(name='counts') 

Expected:

time    age salary  amount  university  gender  place   education   counts
12/6/2021   24  33333   232323  SK         M    US  BE  2
12/8/2021   30  23656   9496    SE         F    UK  BARC    1
12/9/2021   34  65652   26266   DE         M    UK  BTECH   1
12/6/2021   25  89893   2652    NK         F    GER BSC 2
12/8/2021   70  445464  78989   SE         F    UK  BARC    1
12/9/2021   45  65656   225415  NK         F    GER BTECH   1
12/6/2021   29  5996    3232    NK         M    CAN BTECH   1

The result of the above code has only 5 columns. Is there a way to get all the columns

CodePudding user response:

First idea is create new column by counts and then remove duplciates by some columns, e.g. :

data['counts'] = data.groupby([pd.Grouper(key='time', freq='4min'),'age', 'salary', 'amount','university'])['age'].transform('size')

df = data.drop_duplicates(['age', 'salary', 'amount','university'])

Of use all columns if possible same values per groups:

full_data = data.groupby([pd.Grouper(key='time', freq='4min'),'age', 'salary', 'amount','university', 'gender', 'place', 'education']).size().reset_index(name='counts') 
  • Related