spread then subtract consecutive rows in pandas data frame-CodePudding

I have a data frame where I need to pair consecutive events within the same day and subtract the later from the former. Each has a timestamp and a date.

         time                     date   event   score
0 2022-03-07 06:45:00 00:00  2022-03-07  light   80.066667
1 2022-03-07 18:12:00 00:00  2022-03-07  dark   79.857667
2 2022-03-30 06:25:00 00:00  2022-03-30  light  107.060833
3 2022-03-30 13:38:00 00:00  2022-03-30  dark  105.324000
4 2022-03-30 13:40:00 00:00  2022-03-30  dark  105.239750
5 2022-03-30 15:47:00 00:00  2022-03-30  light  106.863143
6 2022-04-01 06:25:00 00:00  2022-04-01  light  101.271867

I have tried spreading the data frame using

df = df.pivot(index='time', columns='event', values='score')

event                             light       dark
time                                             
2022-03-07 06:45:00 00:00   80.066667         NaN
2022-03-07 18:12:00 00:00         NaN   79.857667
2022-03-30 06:25:00 00:00  107.060833         NaN
2022-03-30 13:38:00 00:00         NaN  105.324000
2022-03-30 13:40:00 00:00         NaN  105.239750
2022-03-30 15:47:00 00:00  106.863143         NaN
2022-04-01 06:25:00 00:00  101.271867         NaN

however because the events happen at different times the spread data frame has NaNs. I would ideally end up with this, where I keep the time of the first occurrence in the pair (light or dark), align the events (note: corresponding dark to match light as not yet occurred for 2022-04-01) and when light comes first I subtract the later value from the former and when dark comes first I subtract the former value from the later.

event                             light  dark          diff
time                                             
2022-03-07 06:45:00 00:00   80.066667    79.857667    -0.208999
2022-03-30 06:25:00 00:00  107.060833    105.324000   -1.7368
2022-03-30 13:40:00 00:00  106.863143    105.239750   -1.6233
2022-04-01 06:25:00 00:00  101.271867    NaN          NaN

CodePudding user response：

Here's one way using groupby cumcount to create groups, then use that groups in groupby first to get the first time each event happens each day. Then pivot.

Finally, use diff to get the difference between "light" and "dark" and assign the differences to column "diff" in df:

out = (df.assign(time=df.groupby(df.groupby('event').cumcount())['time'].transform('first'))
       .pivot('time', 'event', 'score').reset_index().rename_axis([None], axis=1)
       .assign(diff=lambda x: x['dark']-x['light']))

Output:

                        time        dark       light      diff
0  2022-03-07 06:45:00 00:00   79.857667   80.066667 -0.209000
1  2022-03-30 06:25:00 00:00  105.324000  107.060833 -1.736833
2  2022-03-30 13:40:00 00:00  105.239750  106.863143 -1.623393
3  2022-04-01 06:25:00 00:00         NaN  101.271867       NaN