Home > Software design >  get first occurrence of last different row in a group of rows in MySQL table
get first occurrence of last different row in a group of rows in MySQL table

Time:07-20

environment: Server version: 10.7.3-MariaDB-log

tables:

user location history:

CREATE TABLE `location_history` (
  `id` int(10) UNSIGNED NOT NULL,
  `userId` int(10) UNSIGNED DEFAULT NULL,
  `latitude` double(10,8) DEFAULT NULL,
  `longitude` double(11,8) DEFAULT NULL,
  `createdAt` timestamp NOT NULL DEFAULT current_timestamp()
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;

polygons(areas) points:

CREATE TABLE `location_area_points` (
  `id` int(10) UNSIGNED NOT NULL,
  `location_area_id` int(10) UNSIGNED DEFAULT NULL,
  `area_group_id` int(10) UNSIGNED DEFAULT NULL,
  `latitude` double(10,8) DEFAULT NULL,
  `longitude` double(11,8) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;

what I am trying to achieve: find out how long an user has been inside an area,

for example find out when exactly did the userId 1 last enter into area_group_id 24,

now, what we managed to do so far: find in which area, each point is using the following query:

SELECT
    location_history.id,
    location_history.userId,
    location_history.createdAt,
    s.location_area_id
FROM
    location_history
JOIN(
    SELECT
        location_area_points.location_area_id,
        ST_PolygonFromText(
            CONCAT(
                "POLYGON((",
                GROUP_CONCAT(
                    CONCAT(
                        location_area_points.latitude,
                        ' ',
                        location_area_points.longitude
                    ) SEPARATOR ', '
                ),
                "))"
            )
        ) AS polygon
    FROM
        location_area_points
    GROUP BY
        location_area_points.location_area_id
) s
ON
    ST_CONTAINS(
        s.polygon,
        POINT(
            location_history.latitude,
            location_history.longitude
        )
    )
ORDER BY
    createdAt
DESC
    

we get the following as an example with users 1 and 6, :

id  userId  createdAt       location_area_id    
11765   1   2022-07-18 17:03:23 24  
11764   1   2022-07-18 17:03:07 24  
11763   1   2022-07-18 17:02:25 24  
11762   1   2022-07-18 17:02:16 24  
11761   1   2022-07-18 17:01:24 24  
11760   1   2022-07-18 17:00:32 24  
11759   1   2022-07-18 16:59:41 24  
11758   1   2022-07-18 16:59:40 24  <----- include in the results
11757   1   2022-07-18 16:58:49 2   
11756   1   2022-07-18 16:58:04 2   
11755   1   2022-07-18 16:57:06 2   
11754   1   2022-07-18 16:56:23 24  
11752   1   2022-07-18 16:56:14 24  
11753   1   2022-07-18 16:56:14 24  
11751   1   2022-07-18 16:54:31 24  
11750   1   2022-07-18 16:54:30 24  
11749   6   2022-07-18 16:53:39 5   
11748   6   2022-07-18 16:52:47 5   
11747   6   2022-07-18 16:51:56 5   <----- include in the results
11746   6   2022-07-18 16:51:55 24  
11744   6   2022-07-18 16:51:04 24  
11745   1   2022-07-18 16:51:04 24  
11743   1   2022-07-18 16:50:13 24  
11740   1   2022-07-18 16:49:20 24  
11738   1   2022-07-18 16:48:29 24  

now I would like to run additional query on the result above, to find out the first occurrence of the last group please see the code above "include in the results"

so the final result should be :

 id userId  createdAt       location_area_id    
11758   1   2022-07-18 16:59:40 24
11747   6   2022-07-18 16:51:56 5

I apologize if my question is not structured well as I am not sure how to ask such a complicated question and I am open to advice/modification to the question at hand.

CodePudding user response:

@danblack commented above that this is a window function problem, but there's a catch: MySQL 8 does not support nested window functions.

What window functions do is take the result of your query, partition it by rules you state, then allow you to add extra columns to the result with information based on that grouping.

You provided the table definitions - yay! You did not provide insert statements for the test data - aww. To test my answer I will be creating a table that matches your result set above, then selecting all to perform the window functions.

Window functions allow you to add extra columns in the result, that include information about neighboring rows.

SELECT
    location_history.id,
    location_history.userId,
    location_history.createdAt,
    s.location_area_id,
    LAG(location_area_id) OVER (PARTITION BY userId ORDER BY createdAt) as previous_area
FROM
... (the rest of your query)

This will produce a result set just like the one above, but with a column for where that user was in the previous record.

Window functions apply to the result of a query after the query is essentially complete, so there's not a way to narrow the results except to use them as a subquery. To get only the rows where the location changed:

SELECT * from (
    // insert big query here
) as `transitions`
WHERE where location_area_id != previous_area OR previous_area IS NULL

You may not care about the case where previous_area is null, it is just the first record for each user. But if the user hasn't changed locations since, then that record may be relevant to you.

Now we have a list of every time the location changed.

 ------- -------- --------------------- ------------------ --------------- 
| id    | userId | createdAt           | location_area_id | previous_area |
 ------- -------- --------------------- ------------------ --------------- 
| 11758 |      1 | 2022-07-18 16:59:40 |               24 |             2 |
| 11755 |      1 | 2022-07-18 16:57:06 |                2 |            24 |
| 11747 |      6 | 2022-07-18 16:51:56 |                5 |            24 |
| 11744 |      6 | 2022-07-18 16:51:04 |               24 |          NULL |
| 11738 |      1 | 2022-07-18 16:48:29 |               24 |          NULL |
 ------- -------- --------------------- ------------------ --------------- 

The remaining challenge is to find the LATEST row for each user. It seems like a good time for another window function, but that's not supported in MySQL 8. We can modify the above to get that result:

SELECT MAX(id) as last_transition from (
    // insert big query here
) as t
WHERE where location_area_id != previous_area OR previous_area IS NULL
GROUP BY userId

which yields

 ----------------- 
| last_transition |
 ----------------- 
|           11758 |
|           11747 |
 ----------------- 

A quick check shows that this agrees with the records you indicate in the question.

So now we can take the big query, and join it with this result (that also uses the big query), and have the complete answer:

WITH big_query AS (
SELECT
    location_history.id,
    location_history.userId,
    location_history.createdAt,
    s.location_area_id
    LAG(location_area_id) OVER (PARTITION BY userId ORDER BY createdAt) as previous_area
FROM
    location_history
JOIN(
    SELECT
        location_area_points.location_area_id,
        ST_PolygonFromText(
            CONCAT(
                "POLYGON((",
                GROUP_CONCAT(
                    CONCAT(
                        location_area_points.latitude,
                        ' ',
                        location_area_points.longitude
                    ) SEPARATOR ', '
                ),
                "))"
            )
        ) AS polygon
    FROM
        location_area_points
    GROUP BY
        location_area_points.location_area_id
) s
ON
    ST_CONTAINS(
        s.polygon,
        POINT(
            location_history.latitude,
            location_history.longitude
        )
    )
ORDER BY
    createdAt
DESC
)

SELECT * from big_query
  JOIN (
    SELECT MAX(id) as id FROM big_query
    WHERE location_area_id != previous_area OR previous_area IS NULL 
    GROUP BY userId
  ) as last_transitions using(id)

with the final answer

 ------- -------- --------------------- ------------------ --------------- 
| id    | userId | createdAt           | location_area_id | previous_area |
 ------- -------- --------------------- ------------------ --------------- 
| 11758 |      1 | 2022-07-18 16:59:40 |               24 |             2 |
| 11747 |      6 | 2022-07-18 16:51:56 |                5 |            24 |
 ------- -------- --------------------- ------------------ --------------- 

The WITH statement lets you take a SELECT result and treat it like a table in its own right for the duration of the query. You might think of a better name than "big_query" :).

CodePudding user response:

Thanks to @Jerry https://stackoverflow.com/a/73040422/2294803 with his amazing answer. I used his example and the final query I came up with which works on mariadb 10.7.3

for completness:

SELECT
    *,
    MAX(id) AS last_transition
FROM
    (
    SELECT
        *
    FROM
        (
        SELECT
            location_history.id,
            location_history.userId,
            location_history.createdAt,
            s.location_area_id,
            LAG(location_area_id) OVER(
            PARTITION BY userId
        ORDER BY
            createdAt
        ) AS previous_area
    FROM
        location_history
    JOIN(
        SELECT
            location_area_points.location_area_id,
            ST_PolygonFromText(
                CONCAT(
                    "POLYGON((",
                    GROUP_CONCAT(
                        CONCAT(
                            location_area_points.latitude,
                            ' ',
                            location_area_points.longitude
                        ) SEPARATOR ', '
                    ),
                    "))"
                )
            ) AS POLYGON
        FROM
            location_area_points
        GROUP BY
            location_area_points.location_area_id
    ) s
ON
    ST_CONTAINS(
        s.polygon,
        POINT(
            location_history.latitude,
            location_history.longitude
        )
    )
ORDER BY
    createdAt
DESC
    ) AS `transitions`
WHERE
    location_area_id != previous_area OR previous_area IS NULL
) AS t
WHERE
    location_area_id != previous_area OR previous_area IS NULL
GROUP BY
    userId

gives me the result:

id  userId  createdAt   location_area_id    previous_area   last_transition 
11758   1   2022-07-18 16:59:40 24  2   11758   
5121    3   2022-07-18 00:05:30 2   10  5121    
2364    4   2022-05-03 22:59:48 11  2   2364    
12978   5   2022-07-19 17:12:41 2   10  12978   
1747    12  2022-05-03 12:23:35 2   NULL    1747    
1703    14  2022-05-03 02:49:57 24  NULL    1703    
1734    17  2022-05-03 11:08:43 24  NULL    1734    
2623    24  2022-05-29 11:55:59 2   NULL    2623    
2610    25  2022-05-17 07:39:02 2   NULL    2610    
2620    29  2022-05-29 11:48:04 13  2   2620    
2629    35  2022-05-29 13:40:45 2   NULL    2629    
3215    36  2022-07-17 22:48:32 24  25  3215    
11777   41  2022-07-19 09:47:30 24  NULL    11777   
3252    42  2022-07-17 22:50:09 24  NULL    3252    

which seem exactly what I have been looking for!

Thanks to @Jerry

  • Related