I have a data that shows individual vs grouped trips of a taxi service
|----------------------------------------------------------------------------------------------------------------------------------------------|
| Trip_id | Trip_Created_Time | start_lat | start_lon | end_lat | end_lon | trip_updated_time | Is_Group |
|----------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | 2021-07-01 17:29:51 | 81.91892 | -42.19823 | 81.90281 | -42.38918 | 2021-07-01 17:35:21 | YES |
| 2 | 2021-07-01 17:31:52 | 81.91892 | -42.46920 | 81.97392 | -42.37819 | 2021-07-01 17:52:51 | YES |
| 3 | 2021-07-02 21:50:51 | 81.91892 | -42.01936 | 81.18937 | -42.01967 | 2021-07-02 22:09:09 | NO |
| 4 | 2021-07-02 23:31:41 | 81.91892 | -42.47821 | 81.01792 | -42.17839 | 2021-07-02 23:41:51 | NO |
| 5 | 2021-09-21 20:12:54 | 81.91892 | -42.47821 | 81.63829 | -42.67292 | 2021-09-21 20:42:54 | YES |
| 6 | 2021-09-21 20:15:21 | 81.91892 | -42.47821 | 81.62819 | -42.01927 | 2021-09-21 20:59:21 | YES |
| 7 | 2021-09-21 20:17:23 | 81.91892 | -42.47821 | 81.03926 | -42.36284 | 2021-09-21 21:02:21 | YES |
| 8 | 2021-11-01 02:41:41 | 81.91892 | -42.47821 | 81.36292 | -42.47682 | 2021-07-02 23:41:51 | NO |
| 9 | 2021-12-21 19:19:41 | 81.91892 | -42.47821 | 81.23671 | -42.93628 | 2021-07-02 23:41:51 | NO |
|----------------------------------------------------------------------------------------------------------------------------------------------|
Group = Two or more users start at same location but with different destinations
I'm trying to find the distance from starting lat/lon to the ending lat/lon for grouped and non-grouped trips
Here's my try:
select *,
case when is_group = 'NO'
then haversine(start_lat, start_lon, end_lat, end_lon)
when is_group = 'YES'
then NULL
end as trip_distance
from my_table
The current output is as follows:
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Trip_id | Trip_Created_Time | start_lat | start_lon | end_lat | end_lon | trip_updated_time | Is_Group | trip_distance |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | 2021-07-01 17:29:51 | 81.91892 | -42.19823 | 81.90281 | -42.38918 | 2021-07-01 17:35:21 | YES | NULL |
| 2 | 2021-07-01 17:31:52 | 81.91892 | -42.46920 | 81.97392 | -42.37819 | 2021-07-01 17:52:51 | YES | NULL |
| 3 | 2021-07-02 21:50:51 | 81.91892 | -42.01936 | 81.18937 | -42.01967 | 2021-07-02 22:09:09 | NO | 1.289247 |
| 4 | 2021-07-02 23:31:41 | 81.91892 | -42.47821 | 81.01792 | -42.17839 | 2021-07-02 23:41:51 | NO | 0.387922 |
| 5 | 2021-09-21 20:12:54 | 81.91892 | -42.47821 | 81.63829 | -42.67292 | 2021-09-21 20:42:54 | YES | NULL |
| 6 | 2021-09-21 20:15:21 | 81.91892 | -42.47821 | 81.62819 | -42.01927 | 2021-09-21 20:59:21 | YES | NULL |
| 7 | 2021-09-21 20:17:23 | 81.91892 | -42.47821 | 81.03926 | -42.36284 | 2021-09-21 21:02:21 | YES | NULL |
| 8 | 2021-11-01 02:41:41 | 81.91892 | -42.47821 | 81.36292 | -42.47682 | 2021-07-02 23:41:51 | NO | 3.29181 |
| 9 | 2021-12-21 19:19:41 | 81.91892 | -42.47821 | 81.23671 | -42.93628 | 2021-07-02 23:41:51 | NO | 0.29822 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
How can I calculate the trip_distance if it is in a group i.e. is_group = 'YES'
Edit: The final output distance for a grouped should be the sum of distances of all trips in this group i.e. A->B->C = A B C
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Trip_id | Trip_Created_Time | start_lat | start_lon | end_lat | end_lon | trip_updated_time | Is_Group | trip_distance |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | 2021-07-01 17:29:51 | 81.91892 | -42.19823 | 81.90281 | -42.38918 | 2021-07-01 17:35:21 | YES | 1.28463 | <---
| 2 | 2021-07-01 17:31:52 | 81.91892 | -42.46920 | 81.97392 | -42.37819 | 2021-07-01 17:52:51 | YES | 1.28463 | <--- These two total distances are same since grouped
| 3 | 2021-07-02 21:50:51 | 81.91892 | -42.01936 | 81.18937 | -42.01967 | 2021-07-02 22:09:09 | NO | 1.289247 |
| 4 | 2021-07-02 23:31:41 | 81.91892 | -42.47821 | 81.01792 | -42.17839 | 2021-07-02 23:41:51 | NO | 0.387922 |
| 5 | 2021-09-21 20:12:54 | 81.91892 | -42.47821 | 81.63829 | -42.67292 | 2021-09-21 20:42:54 | YES | 4.38921 | <---
| 6 | 2021-09-21 20:15:21 | 81.91892 | -42.47821 | 81.62819 | -42.01927 | 2021-09-21 20:59:21 | YES | 4.38921 | <---
| 7 | 2021-09-21 20:17:23 | 81.91892 | -42.47821 | 81.03926 | -42.36284 | 2021-09-21 21:02:21 | YES | 4.38921 | <--- These three total distances are same since grouped
| 8 | 2021-11-01 02:41:41 | 81.91892 | -42.47821 | 81.36292 | -42.47682 | 2021-07-02 23:41:51 | NO | 3.29181 |
| 9 | 2021-12-21 19:19:41 | 81.91892 | -42.47821 | 81.23671 | -42.93628 | 2021-07-02 23:41:51 | NO | 0.29822 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
CodePudding user response:
If I am understanding correctly, you would need the end point of the previous stop to then calculate the distance to the next stop. I would try something like this.
select Trip_id,
Trip_Created_Time,
trip_updated_time,
start_lat,
start_lon,
lag(end_lat, 1, 0) over (partition by start_lat, date_trunc(hour, Trip_Created_Time) order by Trip_Created_Time, trip_updated_time) as mid_point_lat,
lag(end_lon, 1, 0) over (partition by start_lat, date_trunc(hour, Trip_Created_Time) order by Trip_Created_Time, trip_updated_time) as mid_point_lon,
end_lat,
end_lon,
Is_Group,
case when is_group = 'NO' or (Is_Group = 'YES' and mid_point_lat = 0)
then haversine(start_lat, start_lon, end_lat, end_lon)
when is_group = 'YES' and mid_point_lat != 0
then haversine(mid_point_lat, mid_point_lon, end_lat, end_lon)
end as trip_distance
from my_table
order by Trip_id ;
The above query would get you the true distances for each trip. Last step would be to group those and sum the distances
CodePudding user response:
So you example data/SQL does not produce you example output:
WITH fake_data AS (
SELECT * FROM VALUES
( 1, '2021-07-01 17:29:51', 81.91892, -42.19823, 81.90281, -42.38918, '2021-07-01 17:35:21', 'YES', 1),
( 2, '2021-07-01 17:31:52', 81.91892, -42.46920, 81.97392, -42.37819, '2021-07-01 17:52:51', 'YES', 1),
( 3, '2021-07-02 21:50:51', 81.91892, -42.01936, 81.18937, -42.01967, '2021-07-02 22:09:09', 'NO', null),
( 4, '2021-07-02 23:31:41', 81.91892, -42.47821, 81.01792, -42.17839, '2021-07-02 23:41:51', 'NO', null),
( 5, '2021-09-21 20:12:54', 81.91892, -42.47821, 81.63829, -42.67292, '2021-09-21 20:42:54', 'YES', 2),
( 6, '2021-09-21 20:15:21', 81.91892, -42.47821, 81.62819, -42.01927, '2021-09-21 20:59:21', 'YES', 2),
( 7, '2021-09-21 20:17:23', 81.91892, -42.47821, 81.03926, -42.36284, '2021-09-21 21:02:21', 'YES', 2),
( 8, '2021-11-01 02:41:41', 81.91892, -42.47821, 81.36292, -42.47682, '2021-07-02 23:41:51', 'NO', null),
( 9, '2021-12-21 19:19:41', 81.91892, -42.47821, 81.23671, -42.93628, '2021-07-02 23:41:51', 'NO', null)
t(trip_id, trip_created_time, start_lat, start_lon, end_lat, end_lon, trip_updated_time, is_group, group_id)
)
select *,
case when is_group = 'NO'
then haversine(start_lat, start_lon, end_lat, end_lon)
when is_group = 'YES'
then NULL
end as trip_distance
from fake_data;
creates:
TRIP_ID | TRIP_CREATED_TIME | START_LAT | START_LON | END_LAT | END_LON | TRIP_UPDATED_TIME | IS_GROUP | GROUP_ID | TRIP_DISTANCE |
---|---|---|---|---|---|---|---|---|---|
1 | 2021-07-01 17:29:51 | 81.91892 | -42.19823 | 81.90281 | -42.38918 | 2021-07-01 17:35:21 | YES | 1 | |
2 | 2021-07-01 17:31:52 | 81.91892 | -42.4692 | 81.97392 | -42.37819 | 2021-07-01 17:52:51 | YES | 1 | |
3 | 2021-07-02 21:50:51 | 81.91892 | -42.01936 | 81.18937 | -42.01967 | 2021-07-02 22:09:09 | NO | 81.122258891 | |
4 | 2021-07-02 23:31:41 | 81.91892 | -42.47821 | 81.01792 | -42.17839 | 2021-07-02 23:41:51 | NO | 100.308299209 | |
5 | 2021-09-21 20:12:54 | 81.91892 | -42.47821 | 81.63829 | -42.67292 | 2021-09-21 20:42:54 | YES | 2 | |
6 | 2021-09-21 20:15:21 | 81.91892 | -42.47821 | 81.62819 | -42.01927 | 2021-09-21 20:59:21 | YES | 2 | |
7 | 2021-09-21 20:17:23 | 81.91892 | -42.47821 | 81.03926 | -42.36284 | 2021-09-21 21:02:21 | YES | 2 | |
8 | 2021-11-01 02:41:41 | 81.91892 | -42.47821 | 81.36292 | -42.47682 | 2021-07-02 23:41:51 | NO | 61.824383293 | |
9 | 2021-12-21 19:19:41 | 81.91892 | -42.47821 | 81.23671 | -42.93628 | 2021-07-02 23:41:51 | NO | 76.223649989 |
But if we assume these are valid:
what you describe as wanting ( deconstructed ) seems like (if a grouping id is provided):
SELECT
trip_id
,trip_created_time
,start_lat
,start_lon
,end_lat
,end_lon
,trip_updated_time
,is_group
,round(haversine(start_lat, start_lon, end_lat, end_lon),3) as dist_km
,sum(dist_km) over (partition by group_id) as group_sum_km
,iff(is_group='YES', group_sum_km, dist_km) as result
FROM fake_data
ORDER BY 1
;
gives:
TRIP_ID | TRIP_CREATED_TIME | START_LAT | START_LON | END_LAT | END_LON | TRIP_UPDATED_TIME | IS_GROUP | DIST_KM | GROUP_SUM_KM | RESULT |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2021-07-01 17:29:51 | 81.91892 | -42.19823 | 81.90281 | -42.38918 | 2021-07-01 17:35:21 | YES | 3.484 | 9.762 | 9.762 |
2 | 2021-07-01 17:31:52 | 81.91892 | -42.4692 | 81.97392 | -42.37819 | 2021-07-01 17:52:51 | YES | 6.278 | 9.762 | 9.762 |
3 | 2021-07-02 21:50:51 | 81.91892 | -42.01936 | 81.18937 | -42.01967 | 2021-07-02 22:09:09 | NO | 81.122 | 319.478 | 81.122 |
4 | 2021-07-02 23:31:41 | 81.91892 | -42.47821 | 81.01792 | -42.17839 | 2021-07-02 23:41:51 | NO | 100.308 | 319.478 | 100.308 |
5 | 2021-09-21 20:12:54 | 81.91892 | -42.47821 | 81.63829 | -42.67292 | 2021-09-21 20:42:54 | YES | 31.358 | 162.332 | 162.332 |
6 | 2021-09-21 20:15:21 | 81.91892 | -42.47821 | 81.62819 | -42.01927 | 2021-09-21 20:59:21 | YES | 33.142 | 162.332 | 162.332 |
7 | 2021-09-21 20:17:23 | 81.91892 | -42.47821 | 81.03926 | -42.36284 | 2021-09-21 21:02:21 | YES | 97.832 | 162.332 | 162.332 |
8 | 2021-11-01 02:41:41 | 81.91892 | -42.47821 | 81.36292 | -42.47682 | 2021-07-02 23:41:51 | NO | 61.824 | 319.478 | 61.824 |
9 | 2021-12-21 19:19:41 | 81.91892 | -42.47821 | 81.23671 | -42.93628 | 2021-07-02 23:41:51 | NO | 76.224 | 319.478 | 76.224 |
which can be compacted to:
SELECT
trip_id
,trip_created_time
,start_lat
,start_lon
,end_lat
,end_lon
,trip_updated_time
,is_group
, iff(is_group='YES'
,sum(haversine(start_lat, start_lon, end_lat, end_lon)) over (partition by group_id)
,haversine(start_lat, start_lon, end_lat, end_lon)
) as result
FROM fake_data
ORDER BY 1
;