Home > Software design >  Finding the total distance of a grouped trips in SQL
Finding the total distance of a grouped trips in SQL

Time:02-17

I have a data that shows individual vs grouped trips of a taxi service

|----------------------------------------------------------------------------------------------------------------------------------------------|
|  Trip_id  |   Trip_Created_Time    |   start_lat     |   start_lon    |     end_lat    |    end_lon   | trip_updated_time   |    Is_Group    |
|----------------------------------------------------------------------------------------------------------------------------------------------|
|    1      |   2021-07-01 17:29:51  |     81.91892    |  -42.19823     |   81.90281     |   -42.38918  | 2021-07-01 17:35:21 |      YES       |
|    2      |   2021-07-01 17:31:52  |     81.91892    |  -42.46920     |   81.97392     |   -42.37819  | 2021-07-01 17:52:51 |      YES       |
|    3      |   2021-07-02 21:50:51  |     81.91892    |  -42.01936     |   81.18937     |   -42.01967  | 2021-07-02 22:09:09 |      NO        |
|    4      |   2021-07-02 23:31:41  |     81.91892    |  -42.47821     |   81.01792     |   -42.17839  | 2021-07-02 23:41:51 |      NO        |
|    5      |   2021-09-21 20:12:54  |     81.91892    |  -42.47821     |   81.63829     |   -42.67292  | 2021-09-21 20:42:54 |      YES       |
|    6      |   2021-09-21 20:15:21  |     81.91892    |  -42.47821     |   81.62819     |   -42.01927  | 2021-09-21 20:59:21 |      YES       |
|    7      |   2021-09-21 20:17:23  |     81.91892    |  -42.47821     |   81.03926     |   -42.36284  | 2021-09-21 21:02:21 |      YES       |
|    8      |   2021-11-01 02:41:41  |     81.91892    |  -42.47821     |   81.36292     |   -42.47682  | 2021-07-02 23:41:51 |      NO        |
|    9      |   2021-12-21 19:19:41  |     81.91892    |  -42.47821     |   81.23671     |   -42.93628  | 2021-07-02 23:41:51 |      NO        |
|----------------------------------------------------------------------------------------------------------------------------------------------|

Group = Two or more users start at same location but with different destinations

I'm trying to find the distance from starting lat/lon to the ending lat/lon for grouped and non-grouped trips

Here's my try:

select *,
       case when is_group = 'NO'
            then haversine(start_lat, start_lon, end_lat, end_lon)
        when is_group = 'YES'
            then NULL
       end as trip_distance
from my_table

The current output is as follows:

|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  Trip_id  |   Trip_Created_Time    |   start_lat     |   start_lon    |     end_lat    |    end_lon   | trip_updated_time   |    Is_Group    |  trip_distance |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    1      |   2021-07-01 17:29:51  |     81.91892    |  -42.19823     |   81.90281     |   -42.38918  | 2021-07-01 17:35:21 |      YES       |     NULL       |
|    2      |   2021-07-01 17:31:52  |     81.91892    |  -42.46920     |   81.97392     |   -42.37819  | 2021-07-01 17:52:51 |      YES       |     NULL       |
|    3      |   2021-07-02 21:50:51  |     81.91892    |  -42.01936     |   81.18937     |   -42.01967  | 2021-07-02 22:09:09 |      NO        |    1.289247    |
|    4      |   2021-07-02 23:31:41  |     81.91892    |  -42.47821     |   81.01792     |   -42.17839  | 2021-07-02 23:41:51 |      NO        |    0.387922    |
|    5      |   2021-09-21 20:12:54  |     81.91892    |  -42.47821     |   81.63829     |   -42.67292  | 2021-09-21 20:42:54 |      YES       |     NULL       |
|    6      |   2021-09-21 20:15:21  |     81.91892    |  -42.47821     |   81.62819     |   -42.01927  | 2021-09-21 20:59:21 |      YES       |     NULL       |
|    7      |   2021-09-21 20:17:23  |     81.91892    |  -42.47821     |   81.03926     |   -42.36284  | 2021-09-21 21:02:21 |      YES       |     NULL       |
|    8      |   2021-11-01 02:41:41  |     81.91892    |  -42.47821     |   81.36292     |   -42.47682  | 2021-07-02 23:41:51 |      NO        |    3.29181     |
|    9      |   2021-12-21 19:19:41  |     81.91892    |  -42.47821     |   81.23671     |   -42.93628  | 2021-07-02 23:41:51 |      NO        |    0.29822     |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|

How can I calculate the trip_distance if it is in a group i.e. is_group = 'YES'

Edit: The final output distance for a grouped should be the sum of distances of all trips in this group i.e. A->B->C = A B C

|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  Trip_id  |   Trip_Created_Time    |   start_lat     |   start_lon    |     end_lat    |    end_lon   | trip_updated_time   |    Is_Group    |  trip_distance |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    1      |   2021-07-01 17:29:51  |     81.91892    |  -42.19823     |   81.90281     |   -42.38918  | 2021-07-01 17:35:21 |      YES       |    1.28463     | <--- 
|    2      |   2021-07-01 17:31:52  |     81.91892    |  -42.46920     |   81.97392     |   -42.37819  | 2021-07-01 17:52:51 |      YES       |    1.28463     | <--- These two total distances are same since grouped
|    3      |   2021-07-02 21:50:51  |     81.91892    |  -42.01936     |   81.18937     |   -42.01967  | 2021-07-02 22:09:09 |      NO        |    1.289247    |
|    4      |   2021-07-02 23:31:41  |     81.91892    |  -42.47821     |   81.01792     |   -42.17839  | 2021-07-02 23:41:51 |      NO        |    0.387922    |
|    5      |   2021-09-21 20:12:54  |     81.91892    |  -42.47821     |   81.63829     |   -42.67292  | 2021-09-21 20:42:54 |      YES       |    4.38921     | <---
|    6      |   2021-09-21 20:15:21  |     81.91892    |  -42.47821     |   81.62819     |   -42.01927  | 2021-09-21 20:59:21 |      YES       |    4.38921     | <---
|    7      |   2021-09-21 20:17:23  |     81.91892    |  -42.47821     |   81.03926     |   -42.36284  | 2021-09-21 21:02:21 |      YES       |    4.38921     | <--- These three total distances are same since grouped
|    8      |   2021-11-01 02:41:41  |     81.91892    |  -42.47821     |   81.36292     |   -42.47682  | 2021-07-02 23:41:51 |      NO        |    3.29181     |
|    9      |   2021-12-21 19:19:41  |     81.91892    |  -42.47821     |   81.23671     |   -42.93628  | 2021-07-02 23:41:51 |      NO        |    0.29822     |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|

CodePudding user response:

If I am understanding correctly, you would need the end point of the previous stop to then calculate the distance to the next stop. I would try something like this.

select Trip_id,
       Trip_Created_Time,
       trip_updated_time,
       start_lat,
       start_lon,
       lag(end_lat, 1, 0) over (partition by start_lat, date_trunc(hour, Trip_Created_Time)  order by Trip_Created_Time, trip_updated_time) as mid_point_lat,
       lag(end_lon, 1, 0) over (partition by start_lat, date_trunc(hour, Trip_Created_Time)  order by Trip_Created_Time, trip_updated_time) as mid_point_lon,
       end_lat,
       end_lon,
       Is_Group,
       case when is_group = 'NO' or (Is_Group = 'YES' and mid_point_lat = 0)
            then haversine(start_lat, start_lon, end_lat, end_lon)
        when is_group = 'YES' and mid_point_lat != 0
            then haversine(mid_point_lat, mid_point_lon, end_lat, end_lon)
       end as trip_distance
from my_table
order by Trip_id ;

The above query would get you the true distances for each trip. Last step would be to group those and sum the distances

CodePudding user response:

So you example data/SQL does not produce you example output:

WITH fake_data AS (
    SELECT * FROM VALUES
        ( 1, '2021-07-01 17:29:51', 81.91892, -42.19823, 81.90281, -42.38918, '2021-07-01 17:35:21', 'YES', 1),
        ( 2, '2021-07-01 17:31:52', 81.91892, -42.46920, 81.97392, -42.37819, '2021-07-01 17:52:51', 'YES', 1),
        ( 3, '2021-07-02 21:50:51', 81.91892, -42.01936, 81.18937, -42.01967, '2021-07-02 22:09:09', 'NO', null),
        ( 4, '2021-07-02 23:31:41', 81.91892, -42.47821, 81.01792, -42.17839, '2021-07-02 23:41:51', 'NO', null),
        ( 5, '2021-09-21 20:12:54', 81.91892, -42.47821, 81.63829, -42.67292, '2021-09-21 20:42:54', 'YES', 2),
        ( 6, '2021-09-21 20:15:21', 81.91892, -42.47821, 81.62819, -42.01927, '2021-09-21 20:59:21', 'YES', 2),
        ( 7, '2021-09-21 20:17:23', 81.91892, -42.47821, 81.03926, -42.36284, '2021-09-21 21:02:21', 'YES', 2),
        ( 8, '2021-11-01 02:41:41', 81.91892, -42.47821, 81.36292, -42.47682, '2021-07-02 23:41:51', 'NO', null),
        ( 9, '2021-12-21 19:19:41', 81.91892, -42.47821, 81.23671, -42.93628, '2021-07-02 23:41:51', 'NO', null)
        t(trip_id, trip_created_time, start_lat, start_lon, end_lat, end_lon, trip_updated_time, is_group, group_id)
)
select *,
       case when is_group = 'NO'
            then haversine(start_lat, start_lon, end_lat, end_lon)
        when is_group = 'YES'
            then NULL
       end as trip_distance
from fake_data;

creates:

TRIP_ID TRIP_CREATED_TIME START_LAT START_LON END_LAT END_LON TRIP_UPDATED_TIME IS_GROUP GROUP_ID TRIP_DISTANCE
1 2021-07-01 17:29:51 81.91892 -42.19823 81.90281 -42.38918 2021-07-01 17:35:21 YES 1
2 2021-07-01 17:31:52 81.91892 -42.4692 81.97392 -42.37819 2021-07-01 17:52:51 YES 1
3 2021-07-02 21:50:51 81.91892 -42.01936 81.18937 -42.01967 2021-07-02 22:09:09 NO 81.122258891
4 2021-07-02 23:31:41 81.91892 -42.47821 81.01792 -42.17839 2021-07-02 23:41:51 NO 100.308299209
5 2021-09-21 20:12:54 81.91892 -42.47821 81.63829 -42.67292 2021-09-21 20:42:54 YES 2
6 2021-09-21 20:15:21 81.91892 -42.47821 81.62819 -42.01927 2021-09-21 20:59:21 YES 2
7 2021-09-21 20:17:23 81.91892 -42.47821 81.03926 -42.36284 2021-09-21 21:02:21 YES 2
8 2021-11-01 02:41:41 81.91892 -42.47821 81.36292 -42.47682 2021-07-02 23:41:51 NO 61.824383293
9 2021-12-21 19:19:41 81.91892 -42.47821 81.23671 -42.93628 2021-07-02 23:41:51 NO 76.223649989

But if we assume these are valid:

what you describe as wanting ( deconstructed ) seems like (if a grouping id is provided):

SELECT 
    trip_id
    ,trip_created_time
    ,start_lat
    ,start_lon
    ,end_lat
    ,end_lon 
    ,trip_updated_time
    ,is_group
    ,round(haversine(start_lat, start_lon, end_lat, end_lon),3) as dist_km
    ,sum(dist_km) over (partition by group_id) as group_sum_km
    ,iff(is_group='YES', group_sum_km, dist_km) as result
FROM fake_data
ORDER BY 1
;

gives:

TRIP_ID TRIP_CREATED_TIME START_LAT START_LON END_LAT END_LON TRIP_UPDATED_TIME IS_GROUP DIST_KM GROUP_SUM_KM RESULT
1 2021-07-01 17:29:51 81.91892 -42.19823 81.90281 -42.38918 2021-07-01 17:35:21 YES 3.484 9.762 9.762
2 2021-07-01 17:31:52 81.91892 -42.4692 81.97392 -42.37819 2021-07-01 17:52:51 YES 6.278 9.762 9.762
3 2021-07-02 21:50:51 81.91892 -42.01936 81.18937 -42.01967 2021-07-02 22:09:09 NO 81.122 319.478 81.122
4 2021-07-02 23:31:41 81.91892 -42.47821 81.01792 -42.17839 2021-07-02 23:41:51 NO 100.308 319.478 100.308
5 2021-09-21 20:12:54 81.91892 -42.47821 81.63829 -42.67292 2021-09-21 20:42:54 YES 31.358 162.332 162.332
6 2021-09-21 20:15:21 81.91892 -42.47821 81.62819 -42.01927 2021-09-21 20:59:21 YES 33.142 162.332 162.332
7 2021-09-21 20:17:23 81.91892 -42.47821 81.03926 -42.36284 2021-09-21 21:02:21 YES 97.832 162.332 162.332
8 2021-11-01 02:41:41 81.91892 -42.47821 81.36292 -42.47682 2021-07-02 23:41:51 NO 61.824 319.478 61.824
9 2021-12-21 19:19:41 81.91892 -42.47821 81.23671 -42.93628 2021-07-02 23:41:51 NO 76.224 319.478 76.224

which can be compacted to:

SELECT 
    trip_id
    ,trip_created_time
    ,start_lat
    ,start_lon
    ,end_lat
    ,end_lon 
    ,trip_updated_time
    ,is_group
    , iff(is_group='YES'
        ,sum(haversine(start_lat, start_lon, end_lat, end_lon)) over (partition by group_id)
        ,haversine(start_lat, start_lon, end_lat, end_lon)
      ) as result
FROM fake_data
ORDER BY 1
;
  • Related