So I have a dataset of bus times where some of the data was recorded incorrectly and duplicated. The query looks like this
SELECT
trip_id,
expect_arrival_time,
recorded_arrival_time,
DENSE_RANK() OVER(PARTITION BY trip_id ORDER BY recorded_arrival_time) as stop_ranks
FROM
bus_trips
Results look something like this:
trip_id | expect_arrival_time | recorded_arrival_time | stop_ranks |
---|---|---|---|
10001444 | 2021-06-20 22:14:19 UTC | 2021-06-20 22:15:52 UTC | 1 |
10001444 | 2021-06-20 22:17:07 UTC | 2021-06-20 22:19:52 UTC | 2 |
10001444 | 2021-06-20 22:19:00 UTC | 2021-06-20 22:19:52 UTC | 2 |
10001444 | 2021-06-20 22:32:26 UTC | 2021-06-20 22:31:59 UTC | 3 |
What I'd like to be able to do is filter out any time the dense rank has multiples so i'd want a query that returns this
trip_id | expect_arrival_time | recorded_arrival_time | stop_ranks |
---|---|---|---|
10001444 | 2021-06-20 22:14:19 UTC | 2021-06-20 22:15:52 UTC | 1 |
10001444 | 2021-06-20 22:17:07 UTC | 2021-06-20 22:19:52 UTC | 2 |
10001444 | 2021-06-20 22:32:26 UTC | 2021-06-20 22:31:59 UTC | 3 |
Any way I could do this?
CodePudding user response:
Try below
SELECT
trip_id,
MIN(expect_arrival_time) as expect_arrival_time,
recorded_arrival_time,
DENSE_RANK() OVER(PARTITION BY trip_id ORDER BY recorded_arrival_time) as stop_ranks
FROM bus_trips
GROUP BY trip_id, recorded_arrival_time
with output
CodePudding user response:
We can try to use ROW_NUMBER
window function twice to make it.
SELECT trip_id,
expect_arrival_time,
recorded_arrival_time,
ROW_NUMBER() OVER(PARTITION BY trip_id ORDER BY recorded_arrival_time) stop_ranks
FROM (
SELECT
trip_id,
expect_arrival_time,
recorded_arrival_time,
ROW_NUMBER() OVER(PARTITION BY trip_id ORDER BY expect_arrival_time) as rn
FROM
bus_trips
) t1
WHERE rn = 1