Home > other >  SQL occurrences of combinations into a matrix
SQL occurrences of combinations into a matrix

Time:10-01

This seems like a rather common thing to do/query, but I'm not sure what is the best way to approach this and I cannot find similar examples. Basically I have 5 different systems where I've extracted the unique user IDs from the user log table. I want to know the overlap of users across the systems. The resulting tables are like this (user IDs are shared between systems):

sysA
-----
user1
user2
user3

sysB
-----
user2
user3
user4
user5

sysC
-----
user5

Now the output should be like this:

sysA sysB sysC sysD sysE count_distinct(userkey)
1    0    0    0    0    1
1    1    0    0    0    2
1    0    1    0    0    0
etc.

I tried doing this by using GROUP BY CUBE, which is something specific to Oracle and seemed useful in this case, but I couldn't get it to work as it seems I need to join every possible combination first in order to get the right result. Another thing I tried is this:

SELECT sysA_flag, sysB_flag, sysC_flag, sysD_flag, sysE_flag, COUNT(*)
FROM (
  SELECT DISTINCT userId, 1 sysA_flag
  FROM sysA_input_table
) sysA
FULL OUTER JOIN (
  SELECT DISTINCT userId, 1 sysB_flag
  FROM sysB_input_table
) sysB
ON sysA.userId = sysB.userId
FULL OUTER JOIN (
  SELECT DISTINCT userId, 1 sysC_flag
  FROM sysC_input_table
) sysC
ON sysA.userId = sysC.userId
FULL OUTER JOIN (
  SELECT DISTINCT userId, 1 sysD_flag
  FROM sysD_input_table
) sysD
ON sysA.userId = sysD.userId
FULL OUTER JOIN (
  SELECT DISTINCT userId, 1 sysE_flag
  FROM sysE_input_table
) sysE
ON sysA.userId = sysE.userId

GROUP BY (sysA_flag, sysB_flag, sysC_flag, sysD_flag, sysE_flag)

In principle this gives the right output, but doesn't give all combinations (only for sysA). It might work by expanding on this, but it seems like an inefficient way to do it.

How should this be done in a proper way?

CodePudding user response:

The straightforward way should be something like this to start with. Answering the "how to do it" part of your question.

with sysA as (
select 'user1' as user_id from dual union all
select 'user2' from dual union all
select 'user3' from dual 
  ),
sysB as 
(
select 'user2' from dual union all
select 'user3' from dual union all
select 'user4' from dual union all
select 'user5' from dual 
  ),
sysC as
(
select 'user5' from dual
  ), cte as (
  select * from (
  select t1.*,'a' sys from sysa t1
  union all
  select t1.*,'b' from sysb t1
  union all
  select t1.*,'c' from sysc t1
  ) x
  pivot (count(*) for sys in ('a' as a,'b' as b,'c' as c))
  ),
  combinations as (select * from (select level-1 as a from dual connect by level <= 2) t1
cross join (select level-1 as b from dual connect by level <= 2) t2
cross join (select level-1 as c from dual connect by level <= 2) t3 
)
  select t1.a,t1.b,t1.c,count(t2.user_id) as user_count
  from combinations t1
left join cte t2 on t1.a = t2.a and t1.b = t2.b and t1.c = t2.c
group by t1.a,t1.b,t1.c
--having t1.a   t1.b   t1.c = 2

fiddle

  • Related