I have this example df:
df = pd.DataFrame({'customer id':[1,1,1,],
'Date':['2022-09-05 08:38:37.000' ,'2022-09-06 08:38:37.000','2022-09-07 08:38:37.000'],
'country':['US','US','US'],
'step1_check':['step1',np.nan,np.nan],
'step2_check':[np.nan,'step2',np.nan],
'step3_check':[np.nan,np.nan,'step3']})
It is similar to a log for each step with date and time. I want to group by customer to get one row per each customer and replace each step(n)_check
with the time stamp.
I was able to achieve that with classical (inefficient) solution:
In the example df, there are 3 steps_check columns, so I want to track the time stamp:
df['step1_date'] = np.nan
df['step2_date'] = np.nan
df['step3_date'] = np.nan
Then made an np.where
condition to replace the step date if not null
df['step1_date'] = np.where(df['step1_check'].notna(), df['Date'], np.nan )
df['step2_date'] = np.where(df['step2_check'].notna(), df['Date'], np.nan )
df['step3_date'] = np.where(df['step3_check'].notna(), df['Date'], np.nan )
finally, grouped by customer id
to get one row for each customer with number of steps and the dates:
df.groupby(['customer id','country']).agg({'step1_date':'first','step2_date':'first','step3_date':'first'}).reset_index()
output:
customer id country step1_date step2_date step3_date
0 1 US 2022-09-05 08:38:37.000 2022-09-06 08:38:37.000 2022-09-07 08:38:37.000
What is the best approach to automate this for many more steps? it will be inefficient to write many np.where
s conditions for each column
CodePudding user response:
Filter out the steps column, and forward fill them on axis=1 and assign back to the dataframe. Then privot the dataframe, finally add suffix to column name.
steps=df.filter(like='step').ffill(axis=1)
df[steps.columns] = steps
df.pivot('customer id', steps.columns[-1], 'Date').add_suffix('_date')
OUTPUT
step3_check step1_date step2_date step3_date
customer id
1 2022-09-05 08:38:37.000 2022-09-06 08:38:37.000 2022-09-07 08:38:37.000