Home > Software engineering >  pandas group data at 3 month intervals and aggregate list of functions
pandas group data at 3 month intervals and aggregate list of functions

Time:11-14

I have a dataframe like as shown below

df = pd.DataFrame({'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
                   'invoice_id':[1,2,3,4,5,6,7,8,9,10,11,12],
                   'purchase_date' :['2017-04-03 12:35:00','2017-04-03 12:50:00','2018-04-05 12:59:00','2018-05-04 13:14:00','2017-05-05 13:37:00','2018-07-06 13:39:00','2018-07-08 11:30:00','2017-04-08 16:00:00','2019-04-09 22:00:00','2019-04-11 04:00:00','2018-04-13 04:30:00','2017-04-14 08:00:00'],
                   'val' :[5,5,5,5,1,6,5,5,8,3,4,6],
                   'Prod_id':['A1','A1','C1','A1','E1','Q1','G1','F1','G1','H1','J1','A1']})
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

I would like to do the below

a) group the data by subject_id at 3 month intervals (using purchase date column)

b) Compute statistics such as mean, sum, nunique and count for each of the group based on their other variables such as prod_id, val etc

For ex: the earliest purchase date in df starts at 2017-04-03. Here starting month in the dataset is April. So, we count 3 months from Apr. So, APR, MAY and JUN will be M1 and July, Aug and Sept will be M2 and so on. We do this to create data at 3-month intervals. whenever there is no data for in between for 3 months, we put that as zero (0)

So, I tried something like below based on online research

    length_of_purchase_date = 10
    
    date_month_map = {
        str(x)[:length_of_purchase_date]: 'M_%s' % (i 1) for i, x in enumerate(
            sorted(data_df.reset_index()['purchase_date'].unique(), reverse=True)
        )
    } 
    df.reset_index().groupby(['subject_id',
    pd.Grouper(key='pruchase_date', freq='3M')
]).agg({
    'val': [sum, mean,count],
})

I expect my output to be like as shown below (shown for subject_id = 1). Please note that I have to do this on big data with millions of rows.

enter image description here

CodePudding user response:

Use:

#sorting per subject_id, purchase_date
df = df.sort_values(['subject_id','purchase_date'])

#create month groups by convert to month periods with subtract min values
per = df['purchase_date'].dt.to_period('m').astype('int')
df['date_group'] = (per.sub(per.min()) // 3   1)

#custom function
def f(x):
    #get counts
    y = x.value_counts()
    #filter max values anf filter out if counts is 1
    y = y[y.eq(y.max()) & y.ne(1)]
    #if duplicates, empty y get NaN, value if exist one maximum non 1 value
    return y.index[0] if len(y) == 1 else np.nan

#aggregate per both columns
df = df.groupby(['subject_id', 'date_group']).agg(max_date=('purchase_date','max'),
                                                  nunique=('Prod_id','nunique'),
                                                  count_prod_id=('Prod_id','count'),
                                                  sum_val=('val','sum'),
                                                  avg_val=('val','mean'),
                                                  min_val=('val','min'),
                                                  max_val=('val','max'),
                                                  Top1st_prod_id=('Prod_id',f))

#dictionary for fillna by 0
d = dict.fromkeys(df.columns.difference(['max_date','Top1st_prod_id']), 0)
#add missing date_group with reindex 
df = (df.reset_index(level=0)
         .groupby('subject_id')
         .apply(lambda x: x.reindex(range(1, x.index.max()   1)))
         .fillna(d)) 

#custom formating of columns
df['max_date'] = df['max_date'].dt.strftime('%d-%b-%y')
df = df.drop('subject_id', axis=1).reset_index()
df['date_group'] = 'M'   df['date_group'].astype(str)

print (df)
    subject_id date_group   max_date  nunique  count_prod_id  sum_val  \
0            1         M1  05-May-17      2.0            3.0     11.0   
1            1         M2        NaN      0.0            0.0      0.0   
2            1         M3        NaN      0.0            0.0      0.0   
3            1         M4        NaN      0.0            0.0      0.0   
4            1         M5  04-May-18      2.0            2.0     10.0   
5            1         M6  08-Jul-18      2.0            2.0     11.0   
6            2         M1  14-Apr-17      2.0            2.0     11.0   
7            2         M2        NaN      0.0            0.0      0.0   
8            2         M3        NaN      0.0            0.0      0.0   
9            2         M4        NaN      0.0            0.0      0.0   
10           2         M5  13-Apr-18      1.0            1.0      4.0   
11           2         M6        NaN      0.0            0.0      0.0   
12           2         M7        NaN      0.0            0.0      0.0   
13           2         M8        NaN      0.0            0.0      0.0   
14           2         M9  11-Apr-19      2.0            2.0     11.0   

     avg_val  min_val  max_val Top1st_prod_id  
0   3.666667      1.0      5.0             A1  
1   0.000000      0.0      0.0            NaN  
2   0.000000      0.0      0.0            NaN  
3   0.000000      0.0      0.0            NaN  
4   5.000000      5.0      5.0            NaN  
5   5.500000      5.0      6.0            NaN  
6   5.500000      5.0      6.0            NaN  
7   0.000000      0.0      0.0            NaN  
8   0.000000      0.0      0.0            NaN  
9   0.000000      0.0      0.0            NaN  
10  4.000000      4.0      4.0            NaN  
11  0.000000      0.0      0.0            NaN  
12  0.000000      0.0      0.0            NaN  
13  0.000000      0.0      0.0            NaN  
14  5.500000      3.0      8.0            NaN  
  • Related