Home > Blockchain >  Inner-Join on temporary columns using MAX()
Inner-Join on temporary columns using MAX()

Time:10-04

I have the table t in MariaDB (latest), which includes among others the columns person_ID, date_1, date_2. They contain person IDs and string dates respectively. For each ID there is only one date_1, but multiple date_2. Rows either have a date_1 or date_2 that is why I am joining on ID. Here is an example of the table t:

person_ID date_1 date_2
A - 3
A - 5
A 1 -
B - 10
B - 14
B 5 -
C - 11
C - 9
C 7 -

Create and fill table t:

CREATE TABLE t(
id SERIAL,
person_ID TEXT,
date_1 TEXT,
date_2 TEXT,
PRIMARY KEY (id)
);

INSERT INTO t (person_ID, date_2) VALUES ('A', 3);
INSERT INTO t (person_ID, date_2) VALUES ('A', 5);
INSERT INTO t (person_ID, date_1) VALUES ('A', 1);
INSERT INTO t (person_ID, date_2) VALUES ('B', 10);
INSERT INTO t (person_ID, date_2) VALUES ('B', 14);
INSERT INTO t (person_ID, date_1) VALUES ('B', 5);
INSERT INTO t (person_ID, date_2) VALUES ('C', 11);
INSERT INTO t (person_ID, date_2) VALUES ('C', 9);
INSERT INTO t (person_ID, date_1) VALUES ('C', 7);

SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));

The following is an inner-join of two subqueries A and B. Query A gives a distinct list of person_IDs, which contain a date_1, and date_1 itself. On the other hand query B should give a distinct list of person_IDs that contain a date_2, and MAX(date_2).

SELECT A.person_ID, A.date_A,  B.date_B, B.date_B - A.date_A AS diff FROM
    (SELECT person_ID, date_1 AS date_A FROM t
    WHERE date_1 >= 0) A
INNER JOIN
    (SELECT person_ID, MAX(date_2) AS date_B FROM t
    WHERE date_2 >= 0
    GROUP BY person_ID) B
ON A.person_ID = B.person_ID
AND B.date_B > A.date_A
AND (B.date_B - A.date_A) <= 7
GROUP BY A.person_ID;

That gives the output:

person ID date_A date_B diff
A 1 5 4
C 7 9 2

But this would be the desired outcome (ignoring ID = B, because diff = 9):

person ID date_A date_B diff
A 1 5 4
C 7 11 4

I assume MAX(date_2) gives 9 instead of 11 for person_ID = C, because that value was inserted last for date_2.

You can use this link to try it out yourself.

CodePudding user response:

This problem is made harder by your sparse table (rows with NULLs). Here's how I would approach this.

Start with a subquery to clean up the sparse table. It generates a result set where the rows with nulls are removed, generating a result like this.

person_ID     date_1    date_2   diff
        A        1        3        2
        A        1        5        4
        B        5       10        5
        B        5       14        9
        C        7       11        4
        C        7        9        2

This puts the single date_1 value for each person into the rows with the date_2 values. The query to do that is:

 SELECT t.person_ID, b.date_1, t.date_2, t.date_2 - b.date_1 diff
   FROM t 
   JOIN t b      ON t.person_ID = b.person_ID
                AND b.date_1 IS NOT NULL
                AND t.date_2 IS NOT NULL

Let's name the output of that subquery with the alias detail.

Your business logic calls for the very common query pattern. It calls for retrieving the row with the largest diff for each person_ID, as long as diff <= 7. With that detail subquery we can write your logic a little more easily. In your result set you want the row for each person_ID that shows date_1, date_2, and diff for the largest diff, but leaving out any rows with a diff > 7.

First, write another subquery that finds the largest qualifying diff value for each person_ID.

SELECT person_ID, MAX(diff) diff
   FROM detail
  GROUP BY person_ID
 HAVING MAX(diff) <= 7

Then join that subquery to the detail to get your desired result set.

 SELECT detail.*
   FROM detail
   JOIN ( SELECT person_ID, MAX(diff) diff
            FROM detail
           GROUP BY person_ID
           HAVING MAX(diff) <= 7
        ) md ON detail.person_ID = md.person_ID 
             AND detail.diff = md.diff
            

Now, I used a common table expression to write this query: to define the detail. That syntax is available in MariaDB 10.2 (and MySQL 8 ). Putting it together, here is the query.

 WITH detail AS
 (SELECT t.person_ID, b.date_1, t.date_2, t.date_2 - b.date_1 diff
        FROM t 
        JOIN t b ON t.person_ID = b.person_ID
                AND b.date_1 IS NOT NULL
                AND t.date_2 IS NOT NULL
 )
 SELECT detail.*
   FROM detail
   JOIN ( SELECT person_ID, MAX(diff) diff
            FROM detail
           GROUP BY person_ID
           HAVING MAX(diff) <= 7
        ) md ON detail.person_ID = md.person_ID 
             AND detail.diff = md.diff

Summary: the steps to solving your problem.

  • Deal with the sparse-data problem in your input table ... get rid of the input rows with NULL values by filling in date_1 values in the rows that have date_2 values. And, throw in the diffs.
  • Find the largest eligible diff values for each person_ID.
  • Join that list of largest diffs back into the detail table to extract the correct row of the detail table.

Pro tip Don't turn off ONLY_FULL_GROUP_BY. You don't want to rely on MySQL / MariaDB's strange nonstandard extension to GROUP BY, because it sometimes yields the wrong values. When it does that it's baffling.

  • Related