Home > Net >  Missing rows in full outer join
Missing rows in full outer join

Time:09-24

I am trying to count how many users are observed on each of the 3 consecutive days. Each of the 3 intermediate tables (t0, t1, t2) has 2 columns: uid (unique ID) and d0 (or d1 or d2, which is 1 and indicates that the user is observed on that day).

The following query:

select d0,d1,d2, count(*) as user_count from (
select uid, 1 as d0
from my_table
where day=5 and uid is not Null
group by uid
) as t0 full outer join (
select uid, 1 as d1
from my_table
where day=6 and uid is not Null
group by uid
) as t1 on t0.uid = t1.uid
full outer join (
select uid, 1 as d2
from my_table
where day=7 and uid is not Null
group by uid
) as t2 on t0.uid = t2.uid and t1.uid = t2.uid
group by d0,d1,d2 order by d0,d1,d2

produces this output from spark.sql(q).toPandas().set_index(["d0","d1","d2"]):

          user_count
d0 d1 d2            
0  0  1        73455
   1  0        53345
1  0  0        49254
   1  0         8234
      1        78455

Two rows are obviously missing: 0 1 1 and 1 0 1. Why?!

PS1. I understand why 0 0 0 is missing.

PS2. my_table looks approximately like this:

create table my_table (uid integer, day integer);
insert into my_table values
 (1, 5), (1, 6), (1, 7),
 (2, 5), (2, 6),
 (3, 5), (3, 7),
 (4, 6), (4, 7),
 (5, 5),
 (6, 6),
 (7, 7);

For this table I expect the query to return

          user_count
d0 d1 d2            
0  0  1        1      --- uid = 7
   1  0        1      --- uid = 6
      1        1      --- uid = 4
1  0  0        1      --- uid = 5
      1        1      --- uid = 3
   1  0        1      --- uid = 2
      1        1      --- uid = 1

CodePudding user response:

Use two levels of aggregation instead of full join:

select d0, d1, d2, count(*)
from (select uid,
            case when sum(case when day = 5 then 1 else 0 end) > 0
             then 1 else 0 end as d0,
            case when sum(case when day = 6 then 1 else 0 end) > 0
             then 1 else 0 end as d1,
            case when sum(case when day = 7 then 1 else 0 end) > 0
             then 1 else 0 end as d2
      from my_table
      where uid is not Null
      group by uid
     ) u
group by d0, d1, d2;

CodePudding user response:

Regarding the original query the last FULL JOIN should take into account t0.uid could be null due to the first FULL JOIN so it must be OR not AND.

select d0,d1,d2, count(*) as user_count 
from (
   select uid, 1 as d0
   from my_table
   where day=5 and uid is not Null
   group by uid
) as t0 
full outer join (
   select uid, 1 as d1
   from my_table
   where day=6 and uid is not Null
   group by uid
) as t1 on t0.uid = t1.uid
full outer join (
   select uid, 1 as d2
   from my_table
   where day=7 and uid is not Null
   group by uid
) as t2 on t0.uid = t2.uid or t1.uid = t2.uid
group by d0,d1,d2 
order by d0,d1,d2;

SQL Server db<>fiddle

Personally I would stick with the Gordon Linoff's solution.

  • Related