Full outer join not giving the answer I need-CodePudding

I am using PostgreSQL and am having difficulty with getting a series of queries that combine the data from two tables (t1, t2)

t1 is

studyida	gender	age
a	M	1
a	M	2
a	M	3
b	F	4
b	F	5
b	F	6
c	M	13
c	M	14
c	M	15

and t2 is

studyida	studyidb	gender	age
a	z	M	3
a	z	M	4
a	z	M	5
NULL	y	F	7
NULL	y	F	8
NULL	y	F	9
c	x	M	10
c	x	M	11
c	x	M	12
NULL	w	F	7
NULL	w	F	8
NULL	w	F	9
NULL	u	M	7
NULL	u	M	8
NULL	u	M	9

t1 and t2 are related via StudyIDA and gender. What I need is a comprehensive listing from both tables, including including the ages. Sometimes the age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3) but most of the time it does not.

I am looking to create a table like this

StudyIDA	StudyIDB	gender	ageA	ageB
a	z	M	1
a	z	M	2
a	z	M	3	3
a	z	M		4
a	z	M		5
b	NULL	F	4
b	NULL	F	5
b	NULL	F	6
NULL	y	F		7
NULL	y	F		8
NULL	y	F		9
c	x	F	13
c	x	F	14
c	x	F	15
c	x	F		10
c	x	F		11
c	x	F		12
NULL	w	F		7
NULL	w	F		8
NULL	w	F		9
NULL	u	M		7
NULL	u	M		8
NULL	u	M		9

I was thinking that first a full outer join of t1 and t2 would give me what I want but it does not.

Then I was thinking I need a listing of all the individuals (lets call it t3), and then do a series of inserts (e.g. t1 t3 and also t1 t3) into a new table to 'construct' what I need. I am really stuck on the odd times when age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3).

I am still not getting what I need. Here is my code so far

DROP TABLE IF EXISTS t1, t2, t3;

CREATE TEMPORARY TABLE t1 (StudyIDA VARCHAR, gender VARCHAR, age int);
INSERT INTO t1 VALUES 
    ('a','M', 1),('a','M', 2),('a','M', 3),
    ('b','F', 4),('b','F', 5),('b','F', 6),
    ('c','M', 13),('c','M', 14),('c','M', 15);
SELECT * FROM t1;

CREATE TEMPORARY TABLE t2 (StudyIDA VARCHAR, StudyIDB varchar, gender VARCHAR, age int);
INSERT INTO t2 VALUES 
    ('a','z','M', 3), ('a','z','M', 4), ('a','z','M', 5),
    (NULL,'y','F', 7),(NULL,'y','F', 8),(NULL,'y','F', 9),
    ('c','x','M', 10),('c','x','M', 11),('c','x','M', 12),
    (NULL,'w','F', 7),(NULL,'w','F', 8),(NULL,'w','F', 9),
    (NULL,'u','M', 7),(NULL,'u','M', 8),(NULL,'u','M', 9);
SELECT * FROM t2;

CREATE TEMPORARY TABLE t3 (StudyIDA_t1 VARCHAR, gender_t1 VARCHAR, StudyIDA_t2 VARCHAR,StudyIDB varchar,
    gender_t2 VARCHAR);

INSERT INTO t3
    SELECT * FROM (SELECT DISTINCT StudyIDA, gender FROM t1) a FULL OUTER JOIN 
    (SELECT DISTINCT StudyIDA, StudyIDB, gender FROM t2) b 
    ON a.StudyIDA=b.StudyIDA AND a.gender=b.gender
    ORDER BY a.StudyIDA;

SELECT * FROM t3 ORDER BY StudyIDA_t1;


SELECT 'IN t1', * 
    FROM t3 JOIN t1 on t1.StudyIDA=t3.StudyIDA_t1 AND t1.gender=t3.gender_t1
    ORDER BY StudyIDA_t1, StudyIDB;

SELECT 'In t2',* 
    FROM t3 JOIN t2 on t3.StudyIDA_t1=t2.StudyIDA AND t3.gender_t1=t2.gender
    ORDER BY StudyIDA_t1, t3.StudyIDB;

DROP TABLE IF EXISTS t1, t2, t3;

CodePudding user response：

A full join that includes the age maybe?
And some coalesce's for common fields.

SELECT DISTINCT 
  COALESCE(t1.StudyIDA, t2.StudyIDA) AS StudyIDA
, t2.StudyIDB 
, COALESCE(t1.gender, t2.gender) AS gender
, t1.age as ageA
, t2.age as ageB
FROM t1
FULL JOIN t2 
  ON t2.StudyIDA is not distinct from t1.StudyIDA
 AND t2.gender = t1.gender 
 AND t2.age = t1.age
ORDER BY StudyIDA, gender, ageA, ageB;

studyida | studyidb | gender | agea | ageb
:------- | :------- | :----- | ---: | ---:
a        | null     | M      |    1 | null
a        | null     | M      |    2 | null
a        | z        | M      |    3 |    3
a        | z        | M      | null |    4
a        | z        | M      | null |    5
b        | null     | F      |    4 | null
b        | null     | F      |    5 | null
b        | null     | F      |    6 | null
c        | null     | M      |   13 | null
c        | null     | M      |   14 | null
c        | null     | M      |   15 | null
c        | x        | M      | null |   10
c        | x        | M      | null |   11
c        | x        | M      | null |   12
null     | w        | F      | null |    7
null     | y        | F      | null |    7
null     | w        | F      | null |    8
null     | y        | F      | null |    8
null     | w        | F      | null |    9
null     | y        | F      | null |    9
null     | u        | M      | null |    7
null     | u        | M      | null |    8
null     | u        | M      | null |    9

db<>fiddle here

CodePudding user response：

Your sample data indicates that only t2.studyida can be NULL and all other columns should really be declared as NOT NULL.
If so, I suggest this simpler query:

SELECT studyida, b.studyidb, gender, age
     , CASE WHEN a.age IS NULL THEN 'b'
            WHEN b.age IS NULL THEN 'a'
                               ELSE 'a and b' END as source
FROM   t1 a
FULL   JOIN t2 b USING (studyida, gender, age)
ORDER  BY studyida, gender, age;

db<>fiddle here

The USING clause is convenient for identically named join columns. Only a single instance of the joining column is in the result set, effectively what COALESCE(a.col, b.col) gives you otherwise. (You might just use SELECT *.)
You can still reference source columns with table-qualification, like a.age.

I reduced to a single age column and added source. You may or may not want that.

Either way, "age" is subject to bitrot, almost always the wrong choice for a table column, and should typically be replaced by "birthday" or similar.