For the sake of simplicity, suppose you have a table with numbers like:
| number |
----------
|123 |
|1234 |
|12345 |
|123456 |
|111 |
|1111 |
|2 |
|700 |
What would be an efficient way of retrieving the shortest numbers (call them roots or whatever) and all values derived from them, eg:
| root | derivatives |
--------------------------------
| 123 | 1234, 12345, 123456 |
| 111 | 1111 |
Numbers 2
& 700
are excluded from the list because they're unique, and thus have no derivatives.
An output as the above would be ideal, but since it's probably difficult to achieve, the next best thing would be something like below, which I can then post-process:
| root | derivative |
-----------------------
| 123 | 1234 |
| 123 | 12345 |
| 123 | 123456 |
| 111 | 1111 |
My naive initial attempt to at least identify roots (see below) has been running for 4h now with a dataset of ~500k items, but the real one I'd have to inspect consists of millions.
select number
from numbers n1
where exists(
select number
from numbers n2
where n2.number <> n1.number
and n2.number like n1.number || '_%'
);
CodePudding user response:
This works if number
is an integer
or bigint
:
select min(a.number) as root, b.number as derivative
from nums a
cross join lateral generate_series(1, 18) as gs(power)
join nums b
on b.number / (10^gs.power)::bigint = a.number
group by b.number
order by root, derivative;
If number is a character type, then try this:
with groupings as (
select number,
case
when number like (lag(number) over (order by number))||'%' then 0
else 1
end as newgroup
from numchar
), groupnums as (
select number, sum(newgroup) over (order by number) as groupnum
from groupings
), matches as (
select min(number) over (partition by groupnum) as root,
number as derivative
from groupnums
)
select *
from matches
where root != derivative;
There should be only a single sort on groupnum
in this execution since the column is your table's primary key.
db<>fiddle here