How to use Pandas diff() with other columns value as period?-CodePudding

I have a dataframe looking like this:

Timestamp	description
0	Parser starts
12	parsing
24	parsing
26	Parsing finished
28	Parser starts
45	Parsing finished

I want to calculate the how long each parse took. I therefore want the difference between timestamps where (df['description'] == 'Parsing finished') and (df['description'] == 'Parser starts'). I know I can use pd.diff() but I can only find how to use it with a set period. I want to set the period based on the description value.

Expected output:

Timestamp	description	difference
0	Parser starts	NaN
12	parsing	NaN
24	parsing	NaN
26	Parsing finished	26
28	Parser starts	NaN
45	Parsing finished	17

I thought of looping over each row but this seems counterintuitive when using Pandas.

EDIT: updated wrong value thanks to comment of @mozway. Made myself more clear with below table:

Timestamp	description
0	Parser starts
12	parsing
24	parsing
26	Parsing finished
27	Uploading results
28	Parser starts
45	Parsing finished

I do not want the timestamp of uploading results (or other values in between parser starts and parsing finished) to be part of the diff. Therefore grouping on parser starts does not provide the result Im looking for. I only want the diff between parser starts and parsing finished.

CodePudding user response：

You can use a groupby:

import numpy as np

# make groups starting with "Parser starts"
group = df['description'].eq('Parser starts').cumsum()

# set up the grouper
g = df.groupby(group)

# update last value with ptp (= max - min)
df.loc[g.cumcount(ascending=False).eq(0),
       'difference'] = g['Timestamp'].transform(np.ptp)

output:

   Timestamp       description  difference
0          0     Parser starts         NaN
1         12           parsing         NaN
2         24           parsing         NaN
3         26  Parsing finished        26.0
4         28     Parser starts         NaN
5         45  Parsing finished        17.0