I have the table t in MariaDB (latest), which includes among others the columns person_ID, date_1, date_2. They contain person IDs and string dates respectively. For each ID there is only one date_1, but multiple date_2. Rows either have a date_1 or date_2 that is why I am joining on ID. Here is an example of the table t:
person_ID | date_1 | date_2 |
---|---|---|
A | - | 3 |
A | - | 5 |
A | 1 | - |
B | - | 10 |
B | - | 14 |
B | 5 | - |
C | - | 11 |
C | - | 9 |
C | 7 | - |
Create and fill table t:
CREATE TABLE t(
id SERIAL,
person_ID TEXT,
date_1 TEXT,
date_2 TEXT,
PRIMARY KEY (id)
);
INSERT INTO t (person_ID, date_2) VALUES ('A', 3);
INSERT INTO t (person_ID, date_2) VALUES ('A', 5);
INSERT INTO t (person_ID, date_1) VALUES ('A', 1);
INSERT INTO t (person_ID, date_2) VALUES ('B', 10);
INSERT INTO t (person_ID, date_2) VALUES ('B', 14);
INSERT INTO t (person_ID, date_1) VALUES ('B', 5);
INSERT INTO t (person_ID, date_2) VALUES ('C', 11);
INSERT INTO t (person_ID, date_2) VALUES ('C', 9);
INSERT INTO t (person_ID, date_1) VALUES ('C', 7);
SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));
The following is an inner-join of two subqueries A and B. Query A gives a distinct list of person_IDs, which contain a date_1, and date_1 itself. On the other hand query B should give a distinct list of person_IDs that contain a date_2, and MAX(date_2).
SELECT A.person_ID, A.date_A, B.date_B, B.date_B - A.date_A AS diff FROM
(SELECT person_ID, date_1 AS date_A FROM t
WHERE date_1 >= 0) A
INNER JOIN
(SELECT person_ID, MAX(date_2) AS date_B FROM t
WHERE date_2 >= 0
GROUP BY person_ID) B
ON A.person_ID = B.person_ID
AND B.date_B > A.date_A
AND (B.date_B - A.date_A) <= 7
GROUP BY A.person_ID;
That gives the output:
person ID | date_A | date_B | diff |
---|---|---|---|
A | 1 | 5 | 4 |
C | 7 | 9 | 2 |
But this would be the desired outcome (ignoring ID = B, because diff = 9):
person ID | date_A | date_B | diff |
---|---|---|---|
A | 1 | 5 | 4 |
C | 7 | 11 | 4 |
I assume MAX(date_2) gives 9 instead of 11 for person_ID = C, because that value was inserted last for date_2.
You can use this link to try it out yourself.
CodePudding user response:
This problem is made harder by your sparse table (rows with NULLs). Here's how I would approach this.
Start with a subquery to clean up the sparse table. It generates a result set where the rows with nulls are removed, generating a result like this.
person_ID date_1 date_2 diff
A 1 3 2
A 1 5 4
B 5 10 5
B 5 14 9
C 7 11 4
C 7 9 2
This puts the single date_1
value for each person into the rows with the date_2
values. The query to do that is:
SELECT t.person_ID, b.date_1, t.date_2, t.date_2 - b.date_1 diff
FROM t
JOIN t b ON t.person_ID = b.person_ID
AND b.date_1 IS NOT NULL
AND t.date_2 IS NOT NULL
Let's name the output of that subquery with the alias detail
.
Your business logic calls for the very common greatest-n-per-group query pattern. It calls for retrieving the row with the largest diff
for each person_ID
, as long as diff <= 7
. With that detail
subquery we can write your logic a little more easily. In your result set you want the row for each person_ID
that shows date_1
, date_2
, and diff
for the largest diff, but leaving out any rows with a diff
> 7.
First, write another subquery that finds the largest qualifying diff
value for each person_ID
.
SELECT person_ID, MAX(diff) diff
FROM detail
GROUP BY person_ID
HAVING MAX(diff) <= 7
Then join that subquery to the detail to get your desired result set.
SELECT detail.*
FROM detail
JOIN ( SELECT person_ID, MAX(diff) diff
FROM detail
GROUP BY person_ID
HAVING MAX(diff) <= 7
) md ON detail.person_ID = md.person_ID
AND detail.diff = md.diff
Now, I used a common table expression to write this query: to define the detail
. That syntax is available in MariaDB 10.2 (and MySQL 8 ). Putting it together, here is the query.
WITH detail AS
(SELECT t.person_ID, b.date_1, t.date_2, t.date_2 - b.date_1 diff
FROM t
JOIN t b ON t.person_ID = b.person_ID
AND b.date_1 IS NOT NULL
AND t.date_2 IS NOT NULL
)
SELECT detail.*
FROM detail
JOIN ( SELECT person_ID, MAX(diff) diff
FROM detail
GROUP BY person_ID
HAVING MAX(diff) <= 7
) md ON detail.person_ID = md.person_ID
AND detail.diff = md.diff
Summary: the steps to solving your problem.
- Deal with the sparse-data problem in your input table ... get rid of the input rows with NULL values by filling in
date_1
values in the rows that havedate_2
values. And, throw in thediff
s. - Find the largest eligible
diff
values for eachperson_ID
. - Join that list of largest
diff
s back into the detail table to extract the correct row of the detail table.
Pro tip Don't turn off ONLY_FULL_GROUP_BY
. You don't want to rely on MySQL / MariaDB's strange nonstandard extension to GROUP BY, because it sometimes yields the wrong values. When it does that it's baffling.