Home > Net >  How to first join tables and then filter rows to keep the greatest per group
How to first join tables and then filter rows to keep the greatest per group

Time:11-28

I'm trying to write a SELECT statement that first joins two tables and then filters rows, by keeping the max value per group.

Example

The following two tables describe visitors in a zoo.

  • visitors table includes all unique people ever visited the zoo; one row per person.
  • activity_log table describes what each visitor did at the zoo over the course of their visit; one row per activity.

Generate data (reproducible)

The following SQL code is compatible with MySQL:

-- visitors
CREATE TABLE visitors(
                      visitor_id       INTEGER  NOT NULL PRIMARY KEY, 
                      country_of_birth VARCHAR(7) NOT NULL
                      );
                      
INSERT INTO visitors(visitor_id, country_of_birth) VALUES
                    (1,          'Bolivia'),
                    (2,          'UK'),
                    (3,          'UK'),
                    (4,          'Bolivia'),
                    (5,          'UK'),
                    (6,          'UK'),
                    (7,          'France'),
                    (8,          'USA'),
                    (9,          'UK'),
                    (10,         'France');

-- activity_log
CREATE TABLE activity_log(
                          visitor_id         INTEGER  NOT NULL,
                          FOREIGN KEY (visitor_id) REFERENCES visitors(visitor_id),
                          activity_time      DATETIME  NOT NULL,
                          activity_name      VARCHAR(14) NOT NULL,
                          what_was_purchased VARCHAR(8)
                          );
                       
INSERT INTO activity_log(visitor_id, activity_time,          activity_name,     what_was_purchased) VALUES
                        (1,          '2020-09-03 11:15:00',  'visit lions',     NULL),
                        (1,          '2020-09-03 10:30:00',  'use restroom',    NULL),
                        (1,          '2020-09-03 10:10:00',  'visit reptiles',  NULL),
                        (1,          '2020-09-03 10:45:00',  'purchase',        'coffee'),
                        (2,          '2021-02-10 15:30:00',  'visit giftshop',  NULL),
                        (2,          '2021-02-10 15:02:00',  'visit zebras',    NULL),
                        (2,          '2021-02-10 15:45:00',  'visit giraffes',  NULL),
                        (3,          '2021-07-07 13:04:00',  'visit reptiles',  NULL),
                        (3,          '2021-07-07 13:50:00',  'visit bears',     NULL),
                        (3,          '2021-07-07 13:40:00',  'purchase',        'icecream'),
                        (3,          '2021-07-07 14:12:00',  'purchase',        'coffee'),
                        (4,          '2021-08-19 11:33:00',  'visit monkeys',   NULL),
                        (4,          '2021-08-19 11:18:00',  'visit lions',     NULL),
                        (4,          '2021-08-19 11:47:00',  'use restroom',    NULL),
                        (5,          '2022-04-12 10:55:00',  'visit zebras',    NULL),
                        (5,          '2022-04-12 11:42:00',  'purchase',        'coffee'),
                        (5,          '2022-04-12 10:45:00',  'purchase',        'hotdog'),
                        (5,          '2022-04-12 11:27:00',  'purchase',        'popcorn'),
                        (6,          '2022-04-12 14:00:00',  'purchase',        'icecream'),
                        (7,          '2022-05-09 12:38:00',  'use restroom',    NULL),
                        (7,          '2022-05-09 12:52:00',  'visit reptiles',  NULL),
                        (7,          '2022-05-09 12:30:00',  'visit zebras',    NULL),
                        (8,          '2022-07-07 15:00:00',  'purchase',        'popcorn'),
                        (8,          '2022-07-07 15:10:00',  'visit birds',     NULL),
                        (9,          '2022-07-11 12:13:00',  'purchase',        'popcorn'),
                        (9,          '2022-07-11 11:23:00',  'purchase',        'coffee'),
                        (9,          '2022-07-11 11:00:00',  'visit lions',     NULL),
                        (9,          '2022-07-11 11:54:00',  'visit monkeys',   NULL),
                        (10,         '2022-08-31 9:30:00',   'use restroom',    NULL);

The query I want

A table of all UK visitors who purchased something, and what that was. In case a person purchased more than one thing, show the last purchased item. Therefore, a table with 2 columns: (1) visitor_id, (2) what_was_purchased.

Desired Output

#>  ------------ -------------------- 
#> | visitor_id | what_was_purchased |
#>  ------------ -------------------- 
#> |          3 | coffee             |
#> |          5 | coffee             |
#> |          6 | icecream           |
#> |          9 | popcorn            |
#>  ------------ -------------------- #> 

My Attempt

I've gone thus far, and even this one doesn't seem OK:

SELECT * 
FROM visitors AS v
LEFT JOIN activity_log AS al ON v.visitor_id = al.visitor_id 
      AND v.country_of_birth = 'UK' 
      AND al.visitor_id IN (
                        SELECT  visitor_id
                        FROM activity_log
                        GROUP BY visitor_id
                        HAVING SUM(CASE WHEN what_was_purchased IS NULL THEN 0 ELSE 1 END)  > 0
                       );

--  ------------ ------------------ ------------ --------------------- ---------------- -------------------- 
-- | visitor_id | country_of_birth | visitor_id | activity_time       | activity_name  | what_was_purchased |
--  ------------ ------------------ ------------ --------------------- ---------------- -------------------- 
-- |          1 | Bolivia          |       NULL | NULL                | NULL           | NULL               |
-- |          2 | UK               |       NULL | NULL                | NULL           | NULL               |
-- |          3 | UK               |          3 | 2021-07-07 13:04:00 | visit reptiles | NULL               |
-- |          3 | UK               |          3 | 2021-07-07 13:50:00 | visit bears    | NULL               |
-- |          3 | UK               |          3 | 2021-07-07 13:40:00 | purchase       | icecream           |
-- |          3 | UK               |          3 | 2021-07-07 14:12:00 | purchase       | coffee             |
-- |          4 | Bolivia          |       NULL | NULL                | NULL           | NULL               |
-- |          5 | UK               |          5 | 2022-04-12 10:55:00 | visit zebras   | NULL               |
-- |          5 | UK               |          5 | 2022-04-12 11:42:00 | purchase       | coffee             |
-- |          5 | UK               |          5 | 2022-04-12 10:45:00 | purchase       | hotdog             |
-- |          5 | UK               |          5 | 2022-04-12 11:27:00 | purchase       | popcorn            |
-- |          6 | UK               |          6 | 2022-04-12 14:00:00 | purchase       | icecream           |
-- |          7 | France           |       NULL | NULL                | NULL           | NULL               |
-- |          8 | USA              |       NULL | NULL                | NULL           | NULL               |
-- |          9 | UK               |          9 | 2022-07-11 12:13:00 | purchase       | popcorn            |
-- |          9 | UK               |          9 | 2022-07-11 11:23:00 | purchase       | coffee             |
-- |          9 | UK               |          9 | 2022-07-11 11:00:00 | visit lions    | NULL               |
-- |          9 | UK               |          9 | 2022-07-11 11:54:00 | visit monkeys  | NULL               |
-- |         10 | France           |       NULL | NULL                | NULL           | NULL               |
--  ------------ ------------------ ------------ --------------------- ---------------- -------------------- 
-- 19 rows in set (0.00 sec)

Explaining my syntax

  • I did

    LEFT JOIN activity_log AS al ON v.visitor_id = al.visitor_id 
          AND v.country_of_birth = 'UK'
    

    based on this answer, to already have only UK rows prior to the join. As you can see, this didn't work out well, as I still have the other countries with NULL. But I guess I can filter those with a WHERE clause. (However, I have no idea why it wasn't removed like in the referenced answer).

  • I did

    AND al.visitor_id IN (
                            SELECT  visitor_id
                            FROM activity_log
                            GROUP BY visitor_id
                            HAVING SUM(CASE WHEN what_was_purchased IS NULL THEN 0 ELSE 1 END)  > 0
                           );
    

    to filter on the people who had at least one purchase, prior to the join. Here, too, visitor with id 2 was is null and supposed to be removed.

Now what?

For the sake of the question, let's ignore the NULL rows and pretend the filtering by "only UK" and "at least one purchase" was successful:

-- pseudo result I manually edited
--  ------------ ------------------ ------------ --------------------- ---------------- -------------------- 
-- | visitor_id | country_of_birth | visitor_id | activity_time       | activity_name  | what_was_purchased |
--  ------------ ------------------ ------------ --------------------- ---------------- -------------------- 
-- |          3 | UK               |          3 | 2021-07-07 13:04:00 | visit reptiles | NULL               |
-- |          3 | UK               |          3 | 2021-07-07 13:50:00 | visit bears    | NULL               |
-- |          3 | UK               |          3 | 2021-07-07 13:40:00 | purchase       | icecream           |
-- |          3 | UK               |          3 | 2021-07-07 14:12:00 | purchase       | coffee             |           |
-- |          5 | UK               |          5 | 2022-04-12 10:55:00 | visit zebras   | NULL               |
-- |          5 | UK               |          5 | 2022-04-12 11:42:00 | purchase       | coffee             |
-- |          5 | UK               |          5 | 2022-04-12 10:45:00 | purchase       | hotdog             |
-- |          5 | UK               |          5 | 2022-04-12 11:27:00 | purchase       | popcorn            |
-- |          6 | UK               |          6 | 2022-04-12 14:00:00 | purchase       | icecream           |
-- |          9 | UK               |          9 | 2022-07-11 12:13:00 | purchase       | popcorn            |
-- |          9 | UK               |          9 | 2022-07-11 11:23:00 | purchase       | coffee             |
-- |          9 | UK               |          9 | 2022-07-11 11:00:00 | visit lions    | NULL               |
-- |          9 | UK               |          9 | 2022-07-11 11:54:00 | visit monkeys  | NULL               |
--  ------------ ------------------ ------------ --------------------- ---------------- -------------------- 

How can I get, per person, only the row corresponding to the last purchase (as reflected in activity_time column)? Please note that times within-person are jumbled. I've seen this answer which seems to be exactly on the spot, but I can't figure out how to incorporate it in the existing query.

CodePudding user response:

The typical approach, assuming you are using an up-to-date version of MySql, is to use the row_number window function approach:

with l as (
    select l.visitor_id, l.what_was_purchased, 
      Row_Number() over(partition by l.visitor_id order by l.activity_time desc) rn
  from activity_log l
  join visitors v on v.visitor_id = l.visitor_id and v.country_of_birth = 'UK'
  where l.activity_name = 'purchase'
)
select visitor_id, what_was_purchased
from l
where rn = 1;

See Demo Fiddle

  • Related