Inner-Join on temporary columns using MAX()-CodePudding

I have the table t in MariaDB (latest), which includes among others the columns person_ID, date_1, date_2. They contain person IDs and string dates respectively. For each ID there is only one date_1, but multiple date_2. Rows either have a date_1 or date_2 that is why I am joining on ID. Here is an example of the table t:

person_ID	date_1	date_2
A	-	3
A	-	5
A	1	-
B	-	10
B	-	14
B	5	-
C	-	11
C	-	9
C	7	-

Create and fill table t:

CREATE TABLE t(
id SERIAL,
person_ID TEXT,
date_1 TEXT,
date_2 TEXT,
PRIMARY KEY (id)
);

INSERT INTO t (person_ID, date_2) VALUES ('A', 3);
INSERT INTO t (person_ID, date_2) VALUES ('A', 5);
INSERT INTO t (person_ID, date_1) VALUES ('A', 1);
INSERT INTO t (person_ID, date_2) VALUES ('B', 10);
INSERT INTO t (person_ID, date_2) VALUES ('B', 14);
INSERT INTO t (person_ID, date_1) VALUES ('B', 5);
INSERT INTO t (person_ID, date_2) VALUES ('C', 11);
INSERT INTO t (person_ID, date_2) VALUES ('C', 9);
INSERT INTO t (person_ID, date_1) VALUES ('C', 7);

SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));

The following is an inner-join of two subqueries A and B. Query A gives a distinct list of person_IDs, which contain a date_1, and date_1 itself. On the other hand query B should give a distinct list of person_IDs that contain a date_2, and MAX(date_2).

SELECT A.person_ID, A.date_A,  B.date_B, B.date_B - A.date_A AS diff FROM
    (SELECT person_ID, date_1 AS date_A FROM t
    WHERE date_1 >= 0) A
INNER JOIN
    (SELECT person_ID, MAX(date_2) AS date_B FROM t
    WHERE date_2 >= 0
    GROUP BY person_ID) B
ON A.person_ID = B.person_ID
AND B.date_B > A.date_A
AND (B.date_B - A.date_A) <= 7
GROUP BY A.person_ID;

That gives the output:

person ID	date_A	date_B	diff
A	1	5	4
C	7	9	2

But this would be the desired outcome (ignoring ID = B, because diff = 9):

person ID	date_A	date_B	diff
A	1	5	4
C	7	11	4

I assume MAX(date_2) gives 9 instead of 11 for person_ID = C, because that value was inserted last for date_2.

You can use this link to try it out yourself.

CodePudding user response：

This problem is made harder by your sparse table (rows with NULLs). Here's how I would approach this.

Start with a subquery to clean up the sparse table. It generates a result set where the rows with nulls are removed, generating a result like this.

person_ID     date_1    date_2   diff
        A        1        3        2
        A        1        5        4
        B        5       10        5
        B        5       14        9
        C        7       11        4
        C        7        9        2

This puts the single date_1 value for each person into the rows with the date_2 values. The query to do that is:

 SELECT t.person_ID, b.date_1, t.date_2, t.date_2 - b.date_1 diff
   FROM t 
   JOIN t b      ON t.person_ID = b.person_ID
                AND b.date_1 IS NOT NULL
                AND t.date_2 IS NOT NULL

Let's name the output of that subquery with the alias detail.

Your business logic calls for the very common greatest-n-per-group query pattern. It calls for retrieving the row with the largest diff for each person_ID, as long as diff <= 7. With that detail subquery we can write your logic a little more easily. In your result set you want the row for each person_ID that shows date_1, date_2, and diff for the largest diff, but leaving out any rows with a diff > 7.

First, write another subquery that finds the largest qualifying diff value for each person_ID.

SELECT person_ID, MAX(diff) diff
   FROM detail
  GROUP BY person_ID
 HAVING MAX(diff) <= 7

Then join that subquery to the detail to get your desired result set.

 SELECT detail.*
   FROM detail
   JOIN ( SELECT person_ID, MAX(diff) diff
            FROM detail
           GROUP BY person_ID
           HAVING MAX(diff) <= 7
        ) md ON detail.person_ID = md.person_ID 
             AND detail.diff = md.diff

Now, I used a common table expression to write this query: to define the detail. That syntax is available in MariaDB 10.2 (and MySQL 8 ). Putting it together, here is the query.

 WITH detail AS
 (SELECT t.person_ID, b.date_1, t.date_2, t.date_2 - b.date_1 diff
        FROM t 
        JOIN t b ON t.person_ID = b.person_ID
                AND b.date_1 IS NOT NULL
                AND t.date_2 IS NOT NULL
 )
 SELECT detail.*
   FROM detail
   JOIN ( SELECT person_ID, MAX(diff) diff
            FROM detail
           GROUP BY person_ID
           HAVING MAX(diff) <= 7
        ) md ON detail.person_ID = md.person_ID 
             AND detail.diff = md.diff

Summary: the steps to solving your problem.

Deal with the sparse-data problem in your input table ... get rid of the input rows with NULL values by filling in date_1 values in the rows that have date_2 values. And, throw in the diffs.
Find the largest eligible diff values for each person_ID.
Join that list of largest diffs back into the detail table to extract the correct row of the detail table.

Pro tip Don't turn off ONLY_FULL_GROUP_BY. You don't want to rely on MySQL / MariaDB's strange nonstandard extension to GROUP BY, because it sometimes yields the wrong values. When it does that it's baffling.