Home > Net >  Efficient way to retrieve all values from a column that start with other values from the same column
Efficient way to retrieve all values from a column that start with other values from the same column

Time:06-27

For the sake of simplicity, suppose you have a table with numbers like:

| number |
----------
|123     |
|1234    |
|12345   |
|123456  |
|111     |
|1111    |
|2       |
|700     |

What would be an efficient way of retrieving the shortest numbers (call them roots or whatever) and all values derived from them, eg:

| root   | derivatives         |
--------------------------------
| 123    | 1234, 12345, 123456 |
| 111    | 1111                |

Numbers 2 & 700 are excluded from the list because they're unique, and thus have no derivatives.

An output as the above would be ideal, but since it's probably difficult to achieve, the next best thing would be something like below, which I can then post-process:

| root   | derivative |
-----------------------
| 123    | 1234       |
| 123    | 12345      |
| 123    | 123456     |
| 111    | 1111       |

My naive initial attempt to at least identify roots (see below) has been running for 4h now with a dataset of ~500k items, but the real one I'd have to inspect consists of millions.

select number
from numbers n1
where exists(
              select number
              from numbers n2
              where n2.number <> n1.number
                and n2.number like n1.number || '_%'
          );

CodePudding user response:

This works if number is an integer or bigint:

select min(a.number) as root, b.number as derivative
  from nums a
       cross join lateral generate_series(1, 18) as gs(power)
       join nums b 
         on b.number / (10^gs.power)::bigint = a.number
 group by b.number
 order by root, derivative;

If number is a character type, then try this:

with groupings as (
  select number, 
         case
           when number like (lag(number) over (order by number))||'%' then 0
           else 1
         end as newgroup
    from numchar
), groupnums as (
  select number, sum(newgroup) over (order by number) as groupnum
    from groupings
), matches as (
  select min(number) over (partition by groupnum) as root,
         number as derivative
    from groupnums
)
select *
  from matches
 where root != derivative;

There should be only a single sort on groupnum in this execution since the column is your table's primary key.

db<>fiddle here

  • Related