I have a dataframe looking like this:
Timestamp | description |
---|---|
0 | Parser starts |
12 | parsing |
24 | parsing |
26 | Parsing finished |
28 | Parser starts |
45 | Parsing finished |
I want to calculate the how long each parse took. I therefore want the difference between timestamps where (df['description'] == 'Parsing finished')
and (df['description'] == 'Parser starts')
. I know I can use pd.diff()
but I can only find how to use it with a set period. I want to set the period based on the description value.
Expected output:
Timestamp | description | difference |
---|---|---|
0 | Parser starts | NaN |
12 | parsing | NaN |
24 | parsing | NaN |
26 | Parsing finished | 26 |
28 | Parser starts | NaN |
45 | Parsing finished | 17 |
I thought of looping over each row but this seems counterintuitive when using Pandas
.
EDIT: updated wrong value thanks to comment of @mozway. Made myself more clear with below table:
Timestamp | description |
---|---|
0 | Parser starts |
12 | parsing |
24 | parsing |
26 | Parsing finished |
27 | Uploading results |
28 | Parser starts |
45 | Parsing finished |
I do not want the timestamp of uploading results
(or other values in between parser starts
and parsing finished
) to be part of the diff. Therefore grouping on parser starts does not provide the result Im looking for. I only want the diff between parser starts
and parsing finished
.
CodePudding user response:
You can use a groupby
:
import numpy as np
# make groups starting with "Parser starts"
group = df['description'].eq('Parser starts').cumsum()
# set up the grouper
g = df.groupby(group)
# update last value with ptp (= max - min)
df.loc[g.cumcount(ascending=False).eq(0),
'difference'] = g['Timestamp'].transform(np.ptp)
output:
Timestamp description difference
0 0 Parser starts NaN
1 12 parsing NaN
2 24 parsing NaN
3 26 Parsing finished 26.0
4 28 Parser starts NaN
5 45 Parsing finished 17.0