Home > Software design >  Select and join rows when column value changes per group
Select and join rows when column value changes per group

Time:07-16

Assessments table:

 --------------- ----------- --------------------- 
| assessment_id | device_id |     created_at      |
 --------------- ----------- --------------------- 
|             1 |       1   | 2022-07-15 20:03:03 |
|             2 |       2   | 2022-07-15 21:03:03 |
|             3 |       1   | 2022-07-15 22:03:03 |
|             4 |       2   | 2022-07-15 23:03:03 |
|             5 |       2   | 2022-07-15 23:03:03 |
 --------------- ----------- --------------------- 

Results table:

 --------------- --------- -------- 
| assessment_id | test    | result |
 --------------- --------- -------- 
|             1 |       A | PASS   |
|             2 |       B | FAIL   |
|             3 |       A | FAIL   |
|             4 |       B | PASS   |
|             5 |       B | PASS   |
 --------------- --------- -------- 

Objective

I would like to return a row for each time the result of a test_id changes.

For example, Device 1 has Assessments 1 and 3. Test A in Assessment 1 was PASS, and Test A in Assessment 3 was FAIL, so we want to return this change as a row.

Device 2 has Assessments 2, 4, and 5. There was a test result change in Assessments 2 and 4 (Test B changed from FAIL to PASS), we want to return a row for this.

We do not want to return a row for Assessment 5 because Test B was PASS, and it was also PASS in Assessment 4. No change.

The resulting table would look like this:


 ----------- --------- ------------------------ ---------------- ---------------------- -------------------- ------------ ---------------------- 
| device_id | test_id | previous_assessment_id | previous_value | previous_value_date  | next_assessment_id | next_value |   next_value_date    |
 ----------- --------- ------------------------ ---------------- ---------------------- -------------------- ------------ ---------------------- 
|         1 | A       |                      1 | PASS           | 15/07/2022  20:03:03 |                  3 | FAIL       | 15/07/2022  22:03:03 |
|         2 | B       |                      2 | FAIL           | 15/07/2022  21:03:03 |                  4 | PASS       | 15/07/2022  23:03:03 |
 ----------- --------- ------------------------ ---------------- ---------------------- -------------------- ------------ ---------------------- 

I've tried playing around with a couple of queries found here on SO, but they either took a long time and returned the wrong data, or didn't work at all. I don't think this is a duplicate because I'm using multiple tables, and every other question I've seen refers to a single table.

I've also looked at this SO question but could not get the helpful answer to apply to my situation.

I'm having some weird issue getting SQL Fiddle to work, but here is the test schema I've been tinkering with:

CREATE TABLE `assessments` (
  `id` int,
  `device_id` int,
  `created_at` datetime
);

INSERT INTO `so_assessments` (`id`, `device_id`, `created_at`) VALUES (1, 1, '2022-07-09 22:56:00');
INSERT INTO `so_assessments` (`id`, `device_id`, `created_at`) VALUES (2, 2, '2022-07-10 22:56:06');
INSERT INTO `so_assessments` (`id`, `device_id`, `created_at`) VALUES (3, 1, '2022-07-11 22:56:11');
INSERT INTO `so_assessments` (`id`, `device_id`, `created_at`) VALUES (4, 2, '2022-07-12 22:56:17');
INSERT INTO `so_assessments` (`id`, `device_id`, `created_at`) VALUES (5, 2, '2022-07-13 22:56:24');

CREATE TABLE `results` (
  `assessment_id` int,
  `test` enum('A','B'),
  `result` enum('PASS','FAIL')
);

INSERT INTO `results` (`assessment_id`, `test`, `result`) VALUES (1, 'A', 'PASS');
INSERT INTO `results` (`assessment_id`, `test`, `result`) VALUES (2, 'B', 'FAIL');
INSERT INTO `results` (`assessment_id`, `test`, `result`) VALUES (3, 'A', 'FAIL');
INSERT INTO `results` (`assessment_id`, `test`, `result`) VALUES (4, 'B', 'PASS');
INSERT INTO `results` (`assessment_id`, `test`, `result`) VALUES (5, 'B', 'PASS');

CodePudding user response:

If you are using MySQL 8, Window functions can help. https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html

You can partition your results by device and test, and add a column that is the previous value of the result, then use the last row where the result differs from the previous value.

The following query creates a new column in your results with previous_value

SELECT 
    assessment_id,
    device_id,
    test,
    result,
    LAG (result) over w as `previous_value`,
    LAG (assessment_id) over w as `previous_assessment_id`
FROM assessments join results using(assessment_id)
WINDOW w AS (PARTITION BY device_id, test ORDER BY assessment_id)

Yields the result:

 --------------- ----------- ------ -------- ---------------- ------------------------ 
| assessment_id | device_id | test | result | previous_value | previous_assessment_id |
 --------------- ----------- ------ -------- ---------------- ------------------------ 
|             1 |         1 | A    | PASS   | NULL           |                   NULL |
|             3 |         1 | A    | FAIL   | PASS           |                      1 |
|             2 |         2 | B    | FAIL   | NULL           |                   NULL |
|             4 |         2 | B    | PASS   | FAIL           |                      2 |
|             5 |         2 | B    | PASS   | PASS           |                      4 |
 --------------- ----------- ------ -------- ---------------- ------------------------ 

Which is a big part of the battle. Now we need to take that result and find the row for each device/test pair with the highest assessment_id, where result != previous_value.

The window is calculated after GROUP BY, ORDER BY, and even HAVING, so there's not much more that can be done in that query (that I have thought of) to narrow it down to the only the most recent entries for each device/test pair. So the above will have to be a subquery to get the final result.

Note: I am going to assume that if the result never changes, you want to show the first time that result was recorded. In other words, you want to count results with previous_value = NULL as a transition.

Here's a query that lists all the times the test result from a device/test pair changes:

SELECT * FROM 
(SELECT 
    assessment_id,
    device_id,
    test,
    result,
    LAG (result) over w as `previous_value`
FROM assessments join results using(assessment_id)
WINDOW w AS (PARTITION BY `device_id`, `test` ORDER BY `assessment_id`)
) AS t
WHERE result != `previous_value` OR `previous_value` IS NULL

gets the result (I left out previous_assesssment_id and the others for space):

 --------------- ----------- ------ -------- ---------------- 
| assessment_id | device_id | test | result | previous_value |
 --------------- ----------- ------ -------- ---------------- 
|             1 |         1 | A    | PASS   | NULL           |
|             3 |         1 | A    | FAIL   | PASS           |
|             2 |         2 | B    | FAIL   | NULL           |
|             4 |         2 | B    | PASS   | FAIL           |
 --------------- ----------- ------ -------- ---------------- 

EDIT

That's the answer to the question. If the first time the value is set is not of interest, just delete the OR part of the WHERE clause. There rest of this answer is because I convinced myself the problem was to get the MOST RECENT time the value flipped. I'm leaving it here, but only for interest.

Carrying On

This is all the times the outcome was different than previous, plus the first time a result was recorded. Almost there.

It would be tempting at this point to add another window in the outer query to aggregate the rows from above and identify the correct rows. But at least in MySQL 8, nested windows are not supported.

But given that result, we can create a query using MAX() and GROUP BY that gives the assessment_ids of all the rows we ultimately want:

SELECT MAX(assessment_id)
FROM (
    SELECT 
       assessment_id,
       device_id,
       test,
       result,
       LAG (result) over w as `previous_value`,
       LAG (assessment_id) over w as `previous_assessment_id`
    FROM assessments join results using(assessment_id)
    WINDOW w AS (PARTITION BY device_id, test ORDER BY assessment_id)
) AS t

where result != previous_value OR previous_value IS NULL
GROUP BY device_id, test

Which yields:

 -------------------- 
| MAX(assessment_id) |
 -------------------- 
|                  3 |
|                  4 |
 -------------------- 

Now we know exactly which rows we need; but we built all that data about the previous values, and now we need a way to join the result of that query with the result of the subquery.

Happily, MySQL 8 has a way to stash a query and use it multiple times, called Common Table Expressions, that use the WITH clause docs here. So we can create the table with all the fun data, then use it as a subquery to get the id's we ultimately want, and then join that right back with the results we just created:

WITH
  transitions AS (SELECT 
    assessment_id,
    device_id,
    test,
    result,
    LAG (result) over w as `previous_value`,
    LAG (assessment_id) over w as `previous_assessment_id`
FROM assessments join results using(assessment_id)
WINDOW w AS (PARTITION BY device_id, test ORDER BY assessment_id)
)

SELECT transitions.*
FROM transitions
JOIN (
    SELECT MAX(assessment_id) as assessment_id
    FROM transitions
    WHERE result != previous_value OR previous_value IS NULL
    GROUP BY device_id, test
) AS t2 using (assessment_id)

Giving us the final answer (with the other columns you can fill in):

 --------------- ----------- ------ -------- ---------------- ------------------------ 
| assessment_id | device_id | test | result | previous_value | previous_assessment_id |
 --------------- ----------- ------ -------- ---------------- ------------------------ 
|             3 |         1 | A    | FAIL   | PASS           |                      1 |
|             4 |         2 | B    | PASS   | FAIL           |                      2 |
 --------------- ----------- ------ -------- ---------------- ------------------------ 

The first part creates a data set that includes all the information about what came before each test. Then we write a query that gets the id's of the interesting rows in that query, then we join back to the original data set to fill in all the columns.

CodePudding user response:

You can use windows functions to peek at values of previous or successive rows according to an ordering and grouping.

For example:

select *
from (
  select
    a.device_id,
    r.test,
    a.id as prev_assessment_id,
    r.result as prev_result,
    a.created_at as prev_value_date,
    lead(a.id) over(partition by a.device_id
      order by a.created_at) as next_assessment_id,
    lead(r.result) over(partition by a.device_id
      order by a.created_at) as next_result,
    lead(a.created_at) over(partition by a.device_id
      order by a.created_at) as next_value_date
  from assessments a
  join results r on r.assessment_id = a.id
) x
where prev_result <> next_result

Result:

 device_id  test  prev_assessment_id  prev_result  prev_value_date      next_assessment_id  next_result  next_value_date     
 ---------- ----- ------------------- ------------ -------------------- ------------------- ------------ ------------------- 
 1          A     1                   PASS         2022-07-09 22:56:00  3                   FAIL         2022-07-11 22:56:11 
 2          B     2                   FAIL         2022-07-10 22:56:06  4                   PASS         2022-07-12 22:56:17 

See running example at db<>fiddle.

Note: Your setup queries (that I used) include different data, compared to your expected result.

  • Related