I am using PostgreSQL and am having difficulty with getting a series of queries that combine the data from two tables (t1, t2)
t1 is
studyida | gender | age |
---|---|---|
a | M | 1 |
a | M | 2 |
a | M | 3 |
b | F | 4 |
b | F | 5 |
b | F | 6 |
c | M | 13 |
c | M | 14 |
c | M | 15 |
and t2 is
studyida | studyidb | gender | age |
---|---|---|---|
a | z | M | 3 |
a | z | M | 4 |
a | z | M | 5 |
NULL | y | F | 7 |
NULL | y | F | 8 |
NULL | y | F | 9 |
c | x | M | 10 |
c | x | M | 11 |
c | x | M | 12 |
NULL | w | F | 7 |
NULL | w | F | 8 |
NULL | w | F | 9 |
NULL | u | M | 7 |
NULL | u | M | 8 |
NULL | u | M | 9 |
t1 and t2 are related via StudyIDA and gender. What I need is a comprehensive listing from both tables, including including the ages. Sometimes the age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3) but most of the time it does not.
I am looking to create a table like this
StudyIDA | StudyIDB | gender | ageA | ageB |
---|---|---|---|---|
a | z | M | 1 | |
a | z | M | 2 | |
a | z | M | 3 | 3 |
a | z | M | 4 | |
a | z | M | 5 | |
b | NULL | F | 4 | |
b | NULL | F | 5 | |
b | NULL | F | 6 | |
NULL | y | F | 7 | |
NULL | y | F | 8 | |
NULL | y | F | 9 | |
c | x | F | 13 | |
c | x | F | 14 | |
c | x | F | 15 | |
c | x | F | 10 | |
c | x | F | 11 | |
c | x | F | 12 | |
NULL | w | F | 7 | |
NULL | w | F | 8 | |
NULL | w | F | 9 | |
NULL | u | M | 7 | |
NULL | u | M | 8 | |
NULL | u | M | 9 |
I was thinking that first a full outer join of t1 and t2 would give me what I want but it does not.
Then I was thinking I need a listing of all the individuals (lets call it t3), and then do a series of inserts (e.g. t1 t3 and also t1 t3) into a new table to 'construct' what I need. I am really stuck on the odd times when age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3).
I am still not getting what I need. Here is my code so far
DROP TABLE IF EXISTS t1, t2, t3;
CREATE TEMPORARY TABLE t1 (StudyIDA VARCHAR, gender VARCHAR, age int);
INSERT INTO t1 VALUES
('a','M', 1),('a','M', 2),('a','M', 3),
('b','F', 4),('b','F', 5),('b','F', 6),
('c','M', 13),('c','M', 14),('c','M', 15);
SELECT * FROM t1;
CREATE TEMPORARY TABLE t2 (StudyIDA VARCHAR, StudyIDB varchar, gender VARCHAR, age int);
INSERT INTO t2 VALUES
('a','z','M', 3), ('a','z','M', 4), ('a','z','M', 5),
(NULL,'y','F', 7),(NULL,'y','F', 8),(NULL,'y','F', 9),
('c','x','M', 10),('c','x','M', 11),('c','x','M', 12),
(NULL,'w','F', 7),(NULL,'w','F', 8),(NULL,'w','F', 9),
(NULL,'u','M', 7),(NULL,'u','M', 8),(NULL,'u','M', 9);
SELECT * FROM t2;
CREATE TEMPORARY TABLE t3 (StudyIDA_t1 VARCHAR, gender_t1 VARCHAR, StudyIDA_t2 VARCHAR,StudyIDB varchar,
gender_t2 VARCHAR);
INSERT INTO t3
SELECT * FROM (SELECT DISTINCT StudyIDA, gender FROM t1) a FULL OUTER JOIN
(SELECT DISTINCT StudyIDA, StudyIDB, gender FROM t2) b
ON a.StudyIDA=b.StudyIDA AND a.gender=b.gender
ORDER BY a.StudyIDA;
SELECT * FROM t3 ORDER BY StudyIDA_t1;
SELECT 'IN t1', *
FROM t3 JOIN t1 on t1.StudyIDA=t3.StudyIDA_t1 AND t1.gender=t3.gender_t1
ORDER BY StudyIDA_t1, StudyIDB;
SELECT 'In t2',*
FROM t3 JOIN t2 on t3.StudyIDA_t1=t2.StudyIDA AND t3.gender_t1=t2.gender
ORDER BY StudyIDA_t1, t3.StudyIDB;
DROP TABLE IF EXISTS t1, t2, t3;
CodePudding user response:
A full join that includes the age maybe?
And some coalesce's for common fields.
SELECT DISTINCT COALESCE(t1.StudyIDA, t2.StudyIDA) AS StudyIDA , t2.StudyIDB , COALESCE(t1.gender, t2.gender) AS gender , t1.age as ageA , t2.age as ageB FROM t1 FULL JOIN t2 ON t2.StudyIDA is not distinct from t1.StudyIDA AND t2.gender = t1.gender AND t2.age = t1.age ORDER BY StudyIDA, gender, ageA, ageB;
studyida | studyidb | gender | agea | ageb :------- | :------- | :----- | ---: | ---: a | null | M | 1 | null a | null | M | 2 | null a | z | M | 3 | 3 a | z | M | null | 4 a | z | M | null | 5 b | null | F | 4 | null b | null | F | 5 | null b | null | F | 6 | null c | null | M | 13 | null c | null | M | 14 | null c | null | M | 15 | null c | x | M | null | 10 c | x | M | null | 11 c | x | M | null | 12 null | w | F | null | 7 null | y | F | null | 7 null | w | F | null | 8 null | y | F | null | 8 null | w | F | null | 9 null | y | F | null | 9 null | u | M | null | 7 null | u | M | null | 8 null | u | M | null | 9
db<>fiddle here
CodePudding user response:
Your sample data indicates that only t2.studyida
can be NULL
and all other columns should really be declared as NOT NULL
.
If so, I suggest this simpler query:
SELECT studyida, b.studyidb, gender, age
, CASE WHEN a.age IS NULL THEN 'b'
WHEN b.age IS NULL THEN 'a'
ELSE 'a and b' END as source
FROM t1 a
FULL JOIN t2 b USING (studyida, gender, age)
ORDER BY studyida, gender, age;
db<>fiddle here
The USING
clause is convenient for identically named join columns. Only a single instance of the joining column is in the result set, effectively what COALESCE(a.col, b.col)
gives you otherwise. (You might just use SELECT *
.)
You can still reference source columns with table-qualification, like a.age
.
I reduced to a single age
column and added source
. You may or may not want that.
Either way, "age" is subject to bitrot, almost always the wrong choice for a table column, and should typically be replaced by "birthday" or similar.