I have the following pandas dataframe df
:
timestamp | version | actual | pred |
---|---|---|---|
2022-01-19 11:00:00.600 | 1 | 0 | 0 |
2022-01-19 11:00:00.800 | 1 | 0 | 1 |
2022-01-19 11:00:01.200 | 1 | 1 | 0 |
2022-01-19 11:00:01.800 | 1 | 0 | 0 |
2022-01-19 11:00:02.200 | 2 | 1 | 1 |
2022-01-19 11:00:02.600 | 2 | 0 | 0 |
2022-01-19 11:00:03.200 | 3 | 0 | 1 |
2022-01-19 11:00:03.600 | 3 | 0 | 0 |
2022-01-19 11:00:03.800 | 3 | 1 | 0 |
2022-01-19 11:00:03.600 | 4 | 0 | 0 |
2022-01-19 11:00:03.800 | 4 | 0 | 1 |
I need to calculate the total distance in milliseconds between values of 1
in actual
and pred
, grouped by version
.
The expected answer is: 1000 ms
- version 1: (2022-01-19 11:00:01.200 - 2022-01-19 11:00:00.600) = 400 ms
- version 2: (2022-01-19 11:00:02.200 - 2022-01-19 11:00:02.200) = 0 ms
- version 3: (2022-01-19 11:00:03.800 - 2022-01-19 11:00:03.200) = 600 ms
CodePudding user response:
Assuming there exists a single "actual" and single "pred" values for each "version", we could simply subtract values. In other words, if each "version" has exactly 1 "actual" and exactly 1 "pred" value, then df['actual'].eq(1)
will be True for exactly one value per "version"; same for df['pred'].eq(1)
. Since the data is sorted by "version", the versions match.
df['timestamp'] = pd.to_datetime(df['timestamp'])
out = ((df.loc[df['actual'].eq(1), 'timestamp'].to_numpy() -
df.loc[df['pred'].eq(1), 'timestamp'].to_numpy())
.astype('timedelta64[ms]').sum())
If not all versions have a 1 in either of "actual" or "pred", we could still filter the "actual" and "pred" values; then merge on "version" (so that each "version" will have an "actual" and a "pred" value); then find difference and sum
:
df['timestamp'] = pd.to_datetime(df['timestamp'])
merged = (df.loc[df['actual'].eq(1), ['version', 'timestamp']]
.merge(df.loc[df['pred'].eq(1), ['version', 'timestamp']],
on='version', suffixes=('_actual', '_pred')))
out = (merged['timestamp_actual'] - merged['timestamp_pred']).astype('timedelta64[ms]').sum()
Output:
1000