Home > Software engineering >  Group items by 2 columns or similarity with another column
Group items by 2 columns or similarity with another column

Time:09-23

Let's say we have a Sqlite table containing:

name;city;age;id;foo
Alice;New-York;25;13782749;12
Eve;Chicago;23;1938679;34
Bob;New-York;25;824697624;56
Jack;Denver;30;239679163;12
Simone;Denver;30;1687631;99

How to group by h=CONCAT(city,age) but also add in the group the users that have a column foo identical to someone in the group?

As seen in Group items by 2 columns, here is how to group by h:

select dense_rank() over (order by city, age) as grpnum, name, id from t;

How to add this second condition on foo?

Example: here Alice and Bob are in the same group1 because they have identical h=CONCAT(city,age), but Jack should also be in group1 because he has same foo value as Alice: 12. Simone has same CONCAT(city,age) than Jack, therefore she is in the same group as Jack, in group1 as well.

CodePudding user response:

This is much trickier. This is a graph-walking problem, which requires a recursive CTE (Common Table Expression).

The idea is to generate all the edges between names. Then use a recursive CTE to visit all connected nodes (names), avoiding visiting existing ones.

Here is one method:

with recursive edges as (
      select distinct t1.name as name1, t2.name as name2
      from t t1 join
           t t2
           on t1.city = t2.city and t1.age = t2.age or
              t1.foo = t2.foo
     ),
     cte as (
      select name1, name2, min(name1, name2) as first_name,
             ',' || name1 || ',' || name2 || ',' as visited
      from edges
      union all
      select cte.name1, e.name2, min(cte.first_name, e.name2),
             visited || e.name2 || ','
      from cte join
           edges e
           on e.name1 = cte.name2
      where visited not like '%,' || e.name2 || ',%'
     )
select t.*, cte.grpnum
from t join
     (select name1, min(first_name), dense_rank() over (order by min(first_name)) as grpnum
      from cte
      group by name1
     ) cte
     on t.name = cte.name1;

And here is a db<>fiddle.

  • Related