Home > Mobile >  pandas groupby and replace rows values based on a condition
pandas groupby and replace rows values based on a condition

Time:09-27

I have this example df:

df = pd.DataFrame({'customer id':[1,1,1,],
                   'Date':['2022-09-05 08:38:37.000'    ,'2022-09-06 08:38:37.000','2022-09-07 08:38:37.000'],
                   'country':['US','US','US'],
                   'step1_check':['step1',np.nan,np.nan],
                   'step2_check':[np.nan,'step2',np.nan],
                   'step3_check':[np.nan,np.nan,'step3']})

It is similar to a log for each step with date and time. I want to group by customer to get one row per each customer and replace each step(n)_check with the time stamp.

I was able to achieve that with classical (inefficient) solution:

In the example df, there are 3 steps_check columns, so I want to track the time stamp:

df['step1_date'] = np.nan
df['step2_date'] = np.nan
df['step3_date'] = np.nan

Then made an np.where condition to replace the step date if not null

df['step1_date'] = np.where(df['step1_check'].notna(), df['Date'], np.nan )
df['step2_date'] = np.where(df['step2_check'].notna(), df['Date'], np.nan )
df['step3_date'] = np.where(df['step3_check'].notna(), df['Date'], np.nan )

finally, grouped by customer id to get one row for each customer with number of steps and the dates:

df.groupby(['customer id','country']).agg({'step1_date':'first','step2_date':'first','step3_date':'first'}).reset_index()

output:

 customer id country               step1_date               step2_date      step3_date 
0            1      US  2022-09-05 08:38:37.000  2022-09-06 08:38:37.000     2022-09-07 08:38:37.000

What is the best approach to automate this for many more steps? it will be inefficient to write many np.wheres conditions for each column

CodePudding user response:

Filter out the steps column, and forward fill them on axis=1 and assign back to the dataframe. Then privot the dataframe, finally add suffix to column name.

steps=df.filter(like='step').ffill(axis=1)
df[steps.columns] = steps
df.pivot('customer id', steps.columns[-1], 'Date').add_suffix('_date')

OUTPUT

step3_check                step1_date                step2_date                  step3_date
customer id                                                                                
1          2022-09-05 08:38:37.000  2022-09-06 08:38:37.000    2022-09-07 08:38:37.000   
  • Related