Home > database >  Search the first occurrence of a value col by col using a criteria?
Search the first occurrence of a value col by col using a criteria?

Time:02-27

I have a dataframe like as shown below

stud_id prod_id total_qty   ques_date   inv_qty inv_date    bkl_qty bkl_date    csum    accu_qty    accu_date   upto_inv_threshold  upto_bkl_threshold  upto_accu_threshold
0   101 12  100 13/11/2010  7.00000 16/02/2012  15  2013-01-16  15  10  13/08/2021  7.00000 22.00000    32.00000
1   101 12  100 13/11/2010  7.00000 16/02/2012  40  2011-10-22  55  10  13/08/2021  7.00000 62.00000    72.00000
2   101 12  100 13/11/2010  7.00000 16/02/2012  2   2019-09-10  57  10  13/08/2021  7.00000 64.00000    74.00000

df = pd.read_clipboard()

I would like to do the two steps listed below

step-1) search for values >=50 in a dataframe and return only the 1st occurrence.

Do the above search only in 3 columns - upto_inv_threshold, upto_bkl_threshold, upto_accu_threshold but column wise. Meaning, complete the search in one column first and then move to the next column. Ex: we search all values of upto_inv_threshold first, then we search all values of upto_bkl_threshold, later/finally we search all values of upto_accu_threshold

step-2) Get the corresponding date for that 1st occurrence value found in step 1. If the value is found upto_inv_threshold, get the inv_date. If the 1st occurrence value is found in upto_bkl_threshold, get the bkl_date. If the 1st occurrence value is found in upto_accu_threshold, get the accu_date.

I tried the below

df_stage_3.loc[:, 'upto_inv_threshold':'upto_accu_threshold']
np.where(df_stage_3.loc[:, 'upto_inv_threshold':'upto_accu_threshold']>=50)

but this is going nowhere and am not able to proceed further.

We have to do this for each stud_id and prod_id. Currently, in sample data we have only group but in real data we will have multiple groups of stud_id and prod_id.

I expect my output to be like as shown below. we get the date from bkl_date column because the 1st value (that met our criteria >=50) was 62 (present in upto_bkl_threshold)

stud_id, prod_id, fifty_pct_date
101,      12,       2011-10-22

CodePudding user response:

Select the required cols, then create a boolean mask to identify the cells in threshold like columns which have values > 50, then use this boolean mask to mask the values in corresponding date columns. Now group the dataframe by stud_id and prod_id and aggregate using first, finally bfill(backfill) along the column axis to get the first occurrence of date where threshold is reached.

cols = pd.Index(['inv', 'bkl', 'accu'])
mask = df['upto_'   cols   '_threshold'].gt(50)

(
    df[cols   '_date']
    .where(mask.to_numpy())
    .groupby([df['stud_id'], df['prod_id']]).first()
    .bfill(axis=1).iloc[:, 0]
    .rename('fifty_pct_date')
    .reset_index()
)

Result

   stud_id  prod_id fifty_pct_date
0      101       12     2011-10-22

CodePudding user response:

I think you can also get the target date(s) by the following code:

Code:

import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({'stud_id': {0: 101, 1: 101, 2: 101}, 'prod_id': {0: 12, 1: 12, 2: 12}, 'total_qty': {0: 100, 1: 100, 2: 100}, 'ques_date': {0: '13/11/2010', 1: '13/11/2010', 2: '13/11/2010'}, 'inv_qty': {0: 7.0, 1: 7.0, 2: 7.0}, 'inv_date': {0: '16/02/2012', 1: '16/02/2012', 2: '16/02/2012'}, 'bkl_qty': {0: 15, 1: 40, 2: 2}, 'bkl_date': {0: '2013-01-16', 1: '2011-10-22', 2: '2019-09-10'}, 'csum': {0: 15, 1: 55, 2: 57}, 'accu_qty': {0: 10, 1: 10, 2: 10}, 'accu_date': {0: '13/08/2021', 1: '13/08/2021', 2: '13/08/2021'}, 'upto_inv_threshold': {0: 7.0, 1: 7.0, 2: 7.0}, 'upto_bkl_threshold': {0: 22.0, 1: 62.0, 2: 64.0}, 'upto_accu_threshold': {0: 32.0, 1: 72.0, 2: 74.0}})

# Transform df
symbols = ['inv', 'bkl', 'accu']
df1 = df.melt(['stud_id', 'prod_id'], [f'{s}_date' for s in symbols], value_name='date')
df2 = df.melt(['stud_id', 'prod_id'], [f'upto_{s}_threshold' for s in symbols], value_name='threshold')

# Merge and get the target date(s)
df = df1.join(df2.loc[df2.threshold>=50, 'threshold'], how='inner')
df = df.groupby(['stud_id', 'prod_id'], as_index=False)['date'].first()

print(df)

Output:

stud_id prod_id date
101 12 2011-10-22
  • Related