Running SQL Server 14.0.2037.2 on Windows 10 Enterprise version 21H1
I am trying to return IDs only when another columns entry across all rows within that ID matches the entries of a similar column in a second table, again within specific IDs. For example,
create table #F(patentid VARCHAR(8), subclass VARCHAR(3));
insert into #F values ('l','x');
insert into #F values ('l','y');
insert into #F values ('l','z');
insert into #F values ('m','x');
insert into #F values ('n','z');
create table #P(patentid VARCHAR(8), subclass VARCHAR(3));
insert into #P values ('b','x');
insert into #P values ('c','w');
insert into #P values ('c','x');
insert into #P values ('c','y');
insert into #P values ('c','z');
insert into #P values ('d','x');
insert into #P values ('d','y');
insert into #P values ('d','z');
insert into #P values ('e','x');
insert into #P values ('e','y');
insert into #P values ('e','z');
I am trying to write a query that will return, for each patentid in #F, all patentIDs in #P where the subclass entries match exactly. This true for "l" in #F that matches with "d" and "e" in #P [the matching subclasses are 'x', 'y' and 'z'] as well as for "m" in #F that matches with "b" in #P [the matching subclass is "x"]
the output should be:
l, d
l, e
m, b
The code needs to be efficient as both #F and #P contain millions of rows. I have tried Union, but that works with entire sets of rows and doesn't allow matching by a subset (i.e. within a patentid)
Any help much appreciated.
CodePudding user response:
Is this efficient enough for you?
select f.patentId, p.patentid
from (
select patentId, STRING_AGG(subclass,'') WITHIN GROUP (ORDER BY subclass ASC) as class
from #F
group by patentId
) f
inner join (
select patentId, STRING_AGG(subclass,'') WITHIN GROUP (ORDER BY subclass ASC) as class
from #P
group by patentId
) p on f.class = p.class
CodePudding user response:
This is an example of Relational Division Without Remainder, with multiple divisors.
In other words you want to divide (in a relational sense) #F
by #P
and only take results where there is no remainder.
There are many solutions, here is one
- Partition both tables by
patentid
and calculate a windowed count - Left join one table to the other by
subclass
, filtering against exact matching counts - Group by pairs of IDs
- Filter out anything which doesn't have every
f.subclass
with a matchingp.subclass
, we do this withHAVING COUNT(*) = COUNT(p.subclass)
becauseCOUNT(p.subclass)
only counts non-nulls.
SELECT
f.patentid, p.patentid
FROM (
SELECT *,
cnt = COUNT(*) OVER (PARTITION BY f.patentid)
FROM #F f
) f
LEFT JOIN (
SELECT *,
cnt = COUNT(*) OVER (PARTITION BY p.patentid)
FROM #P p
) p ON p.cnt = f.cnt AND p.subclass = f.subclass
GROUP BY
f.patentid, p.patentid
HAVING COUNT(*) = COUNT(p.subclass);