Home > Net >  Full outer join not giving the answer I need
Full outer join not giving the answer I need

Time:02-25

I am using PostgreSQL and am having difficulty with getting a series of queries that combine the data from two tables (t1, t2)

t1 is

studyida gender age
a M 1
a M 2
a M 3
b F 4
b F 5
b F 6
c M 13
c M 14
c M 15

and t2 is

studyida studyidb gender age
a z M 3
a z M 4
a z M 5
NULL y F 7
NULL y F 8
NULL y F 9
c x M 10
c x M 11
c x M 12
NULL w F 7
NULL w F 8
NULL w F 9
NULL u M 7
NULL u M 8
NULL u M 9

t1 and t2 are related via StudyIDA and gender. What I need is a comprehensive listing from both tables, including including the ages. Sometimes the age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3) but most of the time it does not.

I am looking to create a table like this

StudyIDA StudyIDB gender ageA ageB
a z M 1
a z M 2
a z M 3 3
a z M 4
a z M 5
b NULL F 4
b NULL F 5
b NULL F 6
NULL y F 7
NULL y F 8
NULL y F 9
c x F 13
c x F 14
c x F 15
c x F 10
c x F 11
c x F 12
NULL w F 7
NULL w F 8
NULL w F 9
NULL u M 7
NULL u M 8
NULL u M 9

I was thinking that first a full outer join of t1 and t2 would give me what I want but it does not.

Then I was thinking I need a listing of all the individuals (lets call it t3), and then do a series of inserts (e.g. t1 t3 and also t1 t3) into a new table to 'construct' what I need. I am really stuck on the odd times when age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3).

I am still not getting what I need. Here is my code so far

DROP TABLE IF EXISTS t1, t2, t3;

CREATE TEMPORARY TABLE t1 (StudyIDA VARCHAR, gender VARCHAR, age int);
INSERT INTO t1 VALUES 
    ('a','M', 1),('a','M', 2),('a','M', 3),
    ('b','F', 4),('b','F', 5),('b','F', 6),
    ('c','M', 13),('c','M', 14),('c','M', 15);
SELECT * FROM t1;

CREATE TEMPORARY TABLE t2 (StudyIDA VARCHAR, StudyIDB varchar, gender VARCHAR, age int);
INSERT INTO t2 VALUES 
    ('a','z','M', 3), ('a','z','M', 4), ('a','z','M', 5),
    (NULL,'y','F', 7),(NULL,'y','F', 8),(NULL,'y','F', 9),
    ('c','x','M', 10),('c','x','M', 11),('c','x','M', 12),
    (NULL,'w','F', 7),(NULL,'w','F', 8),(NULL,'w','F', 9),
    (NULL,'u','M', 7),(NULL,'u','M', 8),(NULL,'u','M', 9);
SELECT * FROM t2;

CREATE TEMPORARY TABLE t3 (StudyIDA_t1 VARCHAR, gender_t1 VARCHAR, StudyIDA_t2 VARCHAR,StudyIDB varchar,
    gender_t2 VARCHAR);

INSERT INTO t3
    SELECT * FROM (SELECT DISTINCT StudyIDA, gender FROM t1) a FULL OUTER JOIN 
    (SELECT DISTINCT StudyIDA, StudyIDB, gender FROM t2) b 
    ON a.StudyIDA=b.StudyIDA AND a.gender=b.gender
    ORDER BY a.StudyIDA;

SELECT * FROM t3 ORDER BY StudyIDA_t1;


SELECT 'IN t1', * 
    FROM t3 JOIN t1 on t1.StudyIDA=t3.StudyIDA_t1 AND t1.gender=t3.gender_t1
    ORDER BY StudyIDA_t1, StudyIDB;

SELECT 'In t2',* 
    FROM t3 JOIN t2 on t3.StudyIDA_t1=t2.StudyIDA AND t3.gender_t1=t2.gender
    ORDER BY StudyIDA_t1, t3.StudyIDB;

DROP TABLE IF EXISTS t1, t2, t3;

CodePudding user response:

A full join that includes the age maybe?
And some coalesce's for common fields.

SELECT DISTINCT 
  COALESCE(t1.StudyIDA, t2.StudyIDA) AS StudyIDA
, t2.StudyIDB 
, COALESCE(t1.gender, t2.gender) AS gender
, t1.age as ageA
, t2.age as ageB
FROM t1
FULL JOIN t2 
  ON t2.StudyIDA is not distinct from t1.StudyIDA
 AND t2.gender = t1.gender 
 AND t2.age = t1.age
ORDER BY StudyIDA, gender, ageA, ageB;
studyida | studyidb | gender | agea | ageb
:------- | :------- | :----- | ---: | ---:
a        | null     | M      |    1 | null
a        | null     | M      |    2 | null
a        | z        | M      |    3 |    3
a        | z        | M      | null |    4
a        | z        | M      | null |    5
b        | null     | F      |    4 | null
b        | null     | F      |    5 | null
b        | null     | F      |    6 | null
c        | null     | M      |   13 | null
c        | null     | M      |   14 | null
c        | null     | M      |   15 | null
c        | x        | M      | null |   10
c        | x        | M      | null |   11
c        | x        | M      | null |   12
null     | w        | F      | null |    7
null     | y        | F      | null |    7
null     | w        | F      | null |    8
null     | y        | F      | null |    8
null     | w        | F      | null |    9
null     | y        | F      | null |    9
null     | u        | M      | null |    7
null     | u        | M      | null |    8
null     | u        | M      | null |    9

db<>fiddle here

CodePudding user response:

Your sample data indicates that only t2.studyida can be NULL and all other columns should really be declared as NOT NULL.
If so, I suggest this simpler query:

SELECT studyida, b.studyidb, gender, age
     , CASE WHEN a.age IS NULL THEN 'b'
            WHEN b.age IS NULL THEN 'a'
                               ELSE 'a and b' END as source
FROM   t1 a
FULL   JOIN t2 b USING (studyida, gender, age)
ORDER  BY studyida, gender, age;

db<>fiddle here

The USING clause is convenient for identically named join columns. Only a single instance of the joining column is in the result set, effectively what COALESCE(a.col, b.col) gives you otherwise. (You might just use SELECT *.)
You can still reference source columns with table-qualification, like a.age.

I reduced to a single age column and added source. You may or may not want that.

Either way, "age" is subject to bitrot, almost always the wrong choice for a table column, and should typically be replaced by "birthday" or similar.

  • Related