Query to find and subtract two timestamps associated with the same identifier-CodePudding

I'm very new to BigQuery and not terribly familiar with SQL. I have a table of data that looks like this, where MyDate is a Timestamp object:

Row	MyDate	StateTransition	MyIdentifier
1	2022-09-23 00:08:00 UTC	Start	6371
2	2022-10-10 01:17:14 UTC	Finished	6371
3	2022-09-26 04:51:40 UTC	Start	7768
4	2022-10-05 03:44:32 UTC	Finished	7768

etc.

My query looks something like

SELECT *
FROM <my-data-source>
WHERE (StateTransition="Start" OR StateTransition="Finished")
ORDER BY MyIdentifier, MyDate

What I'm trying to do is calculate the elapsed time (in days) between the Start and Finished timestamps associated with each MyIdentifier, and to have that displayed in another column. It could look like:

Row	MyDate	StateTransition	MyIdentifier	ElapsedTime
1	2022-09-23 00:08:00 UTC	Start	6371
2	2022-10-10 01:17:14 UTC	Finished	6371	0.33
3	2022-09-26 04:51:40 UTC	Start	7768
4	2022-10-05 03:44:32 UTC	Finished	7768	0.04

Alternatively, it could even be flattened a little to something like:

Row	StartTransition	FinishedTransition	MyIdentifier	ElapsedTime
1	2022-09-23 00:08:00 UTC	2022-10-10 01:17:14 UTC	6371	0.33
2	2022-09-26 04:51:40 UTC	2022-10-05 03:44:32 UTC	7768	0.04

I've tried looking through the BigQuery docs and Stack Overflow but haven't found anything that addresses this use case of selecting items from multiple rows with a common identifier and then performing an operation on them. It seems like subtracting the two timestamps would be done with the

But for the intermediate result, we need a window function.

SELECT *,
       IF(
         StateTransition = 'Finished',
         TIMESTAMP_DIFF(MyDate, FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w, DAY),
         NULL
       ) AS ElapsedTime
  FROM sample_table
WINDOW w AS (PARTITION BY MyIdentifier ORDER BY MyDate);

and if you want flattend result from the above result (using a window function), the query will looks like below which shows same result as the first query using an aggregation.

SELECT MyIdentifier,
       FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w AS StartTransition,
       MyDate AS FinishedTransition,
       IF(
         StateTransition = 'Finished',
         TIMESTAMP_DIFF(MyDate, FIRST_VALUE(IF(StateTransition = 'Start', MyDate, NULL) IGNORE NULLS) OVER w, DAY),
         NULL
       ) AS ElapsedTime
  FROM sample_table
QUALIFY StateTransition = 'Finished'
WINDOW w AS (PARTITION BY MyIdentifier ORDER BY MyDate);

CodePudding user response：

I think that for each MyIdentifier you should have only one start and one finish, so you can simply split and join:

;WITH 
ts AS ( SELECT * FROM <my-data-source> WHERE StateTransition = 'Start'),
tf AS ( SELECT * FROM <my-data-source> WHERE StateTransition = 'Finished')
SELECT 
    ts.MyIdentifier, 
    ts.MyDate StartTransition, 
    tf.MyDate FinishedTransition, 
    TIMESTAMP_DIFF(ts.MyDate, tf.MyDate, DAY) ElapsedTime
FROM ts
LEFT JOIN tf on ts.MyIdentifier = tf.MyIdentifier