I have a dataframe like as shown below
stud_id prod_id total_qty ques_date inv_qty inv_date bkl_qty bkl_date csum accu_qty accu_date upto_inv_threshold upto_bkl_threshold upto_accu_threshold
0 101 12 100 13/11/2010 7.00000 16/02/2012 15 2013-01-16 15 10 13/08/2021 7.00000 22.00000 32.00000
1 101 12 100 13/11/2010 7.00000 16/02/2012 40 2011-10-22 55 10 13/08/2021 7.00000 62.00000 72.00000
2 101 12 100 13/11/2010 7.00000 16/02/2012 2 2019-09-10 57 10 13/08/2021 7.00000 64.00000 74.00000
df = pd.read_clipboard()
I would like to do the two steps listed below
step-1) search for values >=50 in a dataframe and return only the 1st occurrence.
Do the above search only in 3 columns - upto_inv_threshold
, upto_bkl_threshold
, upto_accu_threshold
but column wise. Meaning, complete the search in one column first and then move to the next column. Ex: we search all values of upto_inv_threshold
first, then we search all values of upto_bkl_threshold
, later/finally we search all values of upto_accu_threshold
step-2) Get the corresponding date for that 1st occurrence value found in step 1. If the value is found upto_inv_threshold
, get the inv_date
. If the 1st occurrence value is found in upto_bkl_threshold
, get the bkl_date
. If the 1st occurrence value is found in upto_accu_threshold
, get the accu_date
.
I tried the below
df_stage_3.loc[:, 'upto_inv_threshold':'upto_accu_threshold']
np.where(df_stage_3.loc[:, 'upto_inv_threshold':'upto_accu_threshold']>=50)
but this is going nowhere and am not able to proceed further.
We have to do this for each stud_id
and prod_id
. Currently, in sample data we have only group but in real data we will have multiple groups of stud_id
and prod_id
.
I expect my output to be like as shown below. we get the date from bkl_date
column because the 1st value (that met our criteria >=50) was 62 (present in upto_bkl_threshold)
stud_id, prod_id, fifty_pct_date
101, 12, 2011-10-22
CodePudding user response:
Select the required cols
, then create a boolean mask
to identify the cells in threshold like columns which have values > 50, then use this boolean mask to mask the values in corresponding date columns. Now group
the dataframe by stud_id
and prod_id
and aggregate using first
, finally bfill
(backfill) along the column axis to get the first occurrence of date where threshold is reached.
cols = pd.Index(['inv', 'bkl', 'accu'])
mask = df['upto_' cols '_threshold'].gt(50)
(
df[cols '_date']
.where(mask.to_numpy())
.groupby([df['stud_id'], df['prod_id']]).first()
.bfill(axis=1).iloc[:, 0]
.rename('fifty_pct_date')
.reset_index()
)
Result
stud_id prod_id fifty_pct_date
0 101 12 2011-10-22
CodePudding user response:
I think you can also get the target date(s) by the following code:
Code:
import pandas as pd
# Create a sample dataframe
df = pd.DataFrame({'stud_id': {0: 101, 1: 101, 2: 101}, 'prod_id': {0: 12, 1: 12, 2: 12}, 'total_qty': {0: 100, 1: 100, 2: 100}, 'ques_date': {0: '13/11/2010', 1: '13/11/2010', 2: '13/11/2010'}, 'inv_qty': {0: 7.0, 1: 7.0, 2: 7.0}, 'inv_date': {0: '16/02/2012', 1: '16/02/2012', 2: '16/02/2012'}, 'bkl_qty': {0: 15, 1: 40, 2: 2}, 'bkl_date': {0: '2013-01-16', 1: '2011-10-22', 2: '2019-09-10'}, 'csum': {0: 15, 1: 55, 2: 57}, 'accu_qty': {0: 10, 1: 10, 2: 10}, 'accu_date': {0: '13/08/2021', 1: '13/08/2021', 2: '13/08/2021'}, 'upto_inv_threshold': {0: 7.0, 1: 7.0, 2: 7.0}, 'upto_bkl_threshold': {0: 22.0, 1: 62.0, 2: 64.0}, 'upto_accu_threshold': {0: 32.0, 1: 72.0, 2: 74.0}})
# Transform df
symbols = ['inv', 'bkl', 'accu']
df1 = df.melt(['stud_id', 'prod_id'], [f'{s}_date' for s in symbols], value_name='date')
df2 = df.melt(['stud_id', 'prod_id'], [f'upto_{s}_threshold' for s in symbols], value_name='threshold')
# Merge and get the target date(s)
df = df1.join(df2.loc[df2.threshold>=50, 'threshold'], how='inner')
df = df.groupby(['stud_id', 'prod_id'], as_index=False)['date'].first()
print(df)
Output:
stud_id | prod_id | date |
---|---|---|
101 | 12 | 2011-10-22 |