Home > Software design >  Extrapolating dataframes to calculate 15min and 30min averages
Extrapolating dataframes to calculate 15min and 30min averages

Time:09-19

Suppose I have a dataframe(the time column has 3 min-windows and grouped by ID-A and ID-B) like this-

ID-A  ID-B    time     sum   num
A       1   09:30:00    5     2
        1   09:33:00    8     2
        1   09:36:00    5     2
        2   09:36:00    10    3
        2   09:39:00    15    3
        2   09:42:00    2     3
B       1   09:30:00    10    2
        1   09:33:00    12    2
        1   09:36:00    5     2

I am trying to calculate 15min and 30min average of the sum divided by num. Reproducible version of my df-

import pandas as pd

data = {'time': ['09:30:00',
                 '09:33:00',
                 '09:36:00',
                 '09:36:00',
                 '09:39:00',
                 '09:42:00',
                 '09:30:00',
                 '09:33:00',
                 '09:36:00'],
         'sum': [5, 8, 5, 10, 15, 2, 10, 12, 5],
         'num': ['2', '2', '2', '3', '3', '3', '2', '2', '2']}
my_index = pd.MultiIndex.from_arrays([["A"]*6   ["B"]*3, [1, 1, 1, 2, 2, 2, 1, 1, 1]], names=["ID-A", "ID-B"])
df = pd.DataFrame(data, index=my_index)

Note:- For 1 pair of ID-A and ID-B, the num is always the same.

Desired Dataframe(grouped by ID-A and ID-B)-

ID-A  ID-B    time     sum   num   15min   30min  
A       1   09:30:00    5     2     15      30  
            09:33:00    8     2     15      30  
            09:36:00    5     2     15      30  
        2   09:36:00    10    3     15      30  
            09:39:00    15    3     15      30  
            09:42:00    2     3     15      30  
B       1   09:30:00    10    2     22.5    45  
            09:33:00    12    2     22.5    45  
            09:36:00    5     2     22.5    45  

For Example - For ID-A -> A and ID-B -> 1, the total time data was available for only 9 minutes. So I did, (5 8 5)/9 = 18/9 = 2 for 1 minute. It also has to be divided by num, so 2/2=1. Therefore, for 15 minutes, it will be 15 and 30 for 30 minutes. There could be an instance where the time data is available for 15 or 30 minutes. Then obviously, extrapolation is not required only normal calculations should happen.

My approach- Since the maximum average I need is 30 minutes, I thought of extrapolating all the values first to 30 minutes so I don't have to care about whether I have all values present. Eventually I just want ID-A, ID-B, 15min and 30min columns only in my df but this will also work.

CodePudding user response:

Looks like this would work?

# cast 'num' to float
df['num'] = df['num'].astype(float)

def add_cols(grp):
    # divide sum by 3xnum of rows, and then divide by 'num'
    multiple = grp['sum'].sum() / (3*len(grp)) / grp.iloc[0, -1]
    return grp.assign(**{'15min': 15 * multiple, '30min': 30 * multiple})

df.groupby(['ID-A', 'ID-B']).apply(add_cols)

Output:

                       time   sum   num     15min   30min
ID-A ID-B                   
A   1   2022-09-18 09:30:00     5   2.0     15.0    30.0
1       2022-09-18 09:33:00     8   2.0     15.0    30.0
1       2022-09-18 09:36:00     5   2.0     15.0    30.0
2       2022-09-18 09:36:00     10  3.0     15.0    30.0
2       2022-09-18 09:39:00     15  3.0     15.0    30.0
2       2022-09-18 09:42:00     2   3.0     15.0    30.0
B   1   2022-09-18 09:30:00     10  2.0     22.5    45.0
1       2022-09-18 09:33:00     12  2.0     22.5    45.0
1       2022-09-18 09:36:00     5   2.0     22.5    45.0
  • Related