I have 4 dataframes like as given below
df_raw = pd.DataFrame(
{'stud_id' : [101, 101,101],
'prod_id':[12,13,16],
'total_qty':[100,1000,80],
'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017']})
df_accu = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,16],
'accu_qty':[10,500,10],
'accu_date' : ['13/08/2021','02/11/2019','17/12/2018']})
df_inv = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,18],
'inv_qty':[5,100,15],
'inv_date' : ['16/02/2022', '22/11/2020','19/10/2019']})
df_bkl = pd.DataFrame(
{'stud_id' : [101,101,101,101],
'prod_id' :[12,12,12,17],
'bkl_qty' :[15,40,2,10],
'bkl_date':['16/01/2022', '22/10/2021','09/10/2020','25/06/2020']})
My objective is to find out the below
a) Get the date when threshold exceeds 50%
threshold is given by the formula below
threshold = (((df_inv['inv_qty'] df_bkl['bkl_qty'] df_accu['accu_qty'])/df_raw['total_qty'])*100)
We have to add in the same order. Meaning, first, we have to add inv_qty
, then bkl_qty
and finally accu_qty
.We do this way in order to identify the correct date when they exceeded 50% of total qty. Additionally, this has to be computed for each stud_id
and prod_id
.
but the problem is df_bkl
has multiple records for the same stud_id
and prod_id
and it is by design. Real data also looks like this. Whereas df_accu
and df_inv
will have only row for each stud_id
and prod_id
.
In the above formula for df['bkl_qty'],we have to use each value of df['bkl_qty']
to compute the sum.
for ex: let's take stud_id = 101
and prod_id = 12
.
His total_qty = 100
, inv_qty = 5
, his accu_qty=10
. but he has three bkl_qty
values - 15,40 and 2. So, threshold has to be computed in a fashion like below
5 (is value of inv_qty) 15 (is 1st value of bkl_qty) 40 (is 2nd value of bkl_qty) 2 (is 3rd value of bkl_qty) 10(is value of accu_qty)
So, now with the above, we can know that his threshold exceeded 50% when his bkl_qty
value was 40. Meaning, 5 15 40 = 60 (which is greater than 50% of total_qty (100)).
I was trying something like below
df_stage_1 = df_raw.merge(df_inv,on=['stud_id','prod_id'], how='left').fillna(0)
df_stage_2 = df_stage_1.merge(df_bkl,on=['stud_id','prod_id'])
df_stage_3 = df_stage_2.merge(df_accu,on=['stud_id','prod_id'])
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] df_stage_3['bkl_qty'] df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
But this is incorrect as I am not able to do each value by value for bkl_qty
from df_bkl
In this post, I have shown only sample data with one stud_id=101
but in real time I have more than 1000's of stud_id
and prod_id
.
Therfore, any elegant and efficient approach would be useful. We have to apply this logic on million record datasets.
I expect my output to be like as shown below. whenever the sum value exceeds 50% of total_qty, we need to get that corresponding date
stud_id,prod_id,total_qty,threshold,threshold_date
101 12 100 72 22/10/2021
CodePudding user response:
It can be achieved using groupby
and cumsum
which does cumulative summation.
# add cumulative sum column to df_bkl
df_bkl['csum'] = df_bkl.groupby(['stud_id','prod_id'])['bkl_qty'].cumsum()
# use df_bkl['csum'] to compute threshold instead of bkl_qty
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] df_stage_3['csum'] df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
# check if inv_qty already exceeds threshold
df_stage_3.loc[df_stage_3.inv_qty > df_stage_3.total_qty/2, 'bkl_date'] = df_stage_3['inv_date']
# next doing some filter and merge to arrive at the desired df
gt_thres = df_stage_3[df_stage_3['threshold'] > df_stage_3['total_qty']/2]
df_f1 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].min().to_frame(name='threshold').reset_index()
df_f2 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].max().to_frame(name='threshold_max').reset_index()
df = pd.merge(df_f1, df_stage_3, on=['stud_id','prod_id','total_qty','threshold'], how='inner')
df2 = pd.merge(df,df_f2, on=['stud_id','prod_id','total_qty'], how='inner')
df2 = df2[['stud_id','prod_id','total_qty','threshold','bkl_date']].rename(columns={'threshold_max':'threshold', 'bkl_date':'threshold_date'})
print(df2)
provides the output as:
stud_id prod_id total_qty threshold threshold_date
0 101 12 100 72.0 22/10/2021
Does this work?