How do I compare whether same/intersection data is there in one and another row in same table when I-CodePudding

I have a table with the sample data below. Now, I just want to compare one record with all other records in the same table and we have to give ID if that record colloids with any other records in the remaining records. And column is with comma separated data, So if we have 'A,C' as Name in one record and 'A' in another record(Check the input from text) then it colloid each other because 'A' is common in both. In the same way one of the record is not having anything in the Name it is NULL. When it is Null it should colloid with remaining other records. Like this Name column I have around 10 columns to verify data.

Input

ID	Name
1	A,C
2	B
3	A
4	NULL

OUTPUT

ID	ColloidID
1	3
1	4
2	4
3	1
3	4
4	1
4	2
4	3

Problem : I have implemented solution like below, and it working fine as expected. But the thing here is it is fine when less data in the table(<100k) but it's taking more time and space when dealing with millions of data(Ex : >20M Data)

 SELECT DISTINCT A.ID,B.ID AS ColloidID
 FROM #Temp1 A
 CROSS APPLY #Temp1 B
 WHERE A.ID<>B.ID
 AND master.dbo.fIntersection(COALESCE(A.Name,B.Name,''),COALESCE(B.Name,A.Name,'')) = 1

CodePudding user response：

Ideally you should not store multiple pieces of info in a single column.

Be that as it may, you can use a nested EXISTS with STRING_SPLIT to compare the two columns.

SELECT t1.ID, t2.ID
FROM #Temp1 t1
JOIN #Temp1 t2 ON t2.ID <> t1.ID
  AND (t1.Name IS NULL OR t2.Name IS NULL
    OR EXISTS (SELECT 1
      FROM STRING_SPLIT(t1.Name, ',') s1
      JOIN STRING_SPLIT(t2.Name, ',') s2 ON s2.value = s1.value
    )
  )
ORDER BY
  t1.ID,
  t2.ID;

db<>fiddle

CodePudding user response：

20M isn't a lot of data, provided a good database design is used, with proper indexes. This is definitely not a good design. It violates the most basic design rule - one value per field. As a result, it's impossible to index Name, forcing 4*10^14 comparisons.

The only way to get acceptable performance is to fix the design. To do that Name has to be split into separate rows. The data needs to be stored in a table whose Name column is covered by an index or primary key:

create table #Id_Names (
    ID bigint not null,
    Name varchar(30)  null,
    INDEX IX_Id_Names (Name,ID)
);
GO
INSERT INTO #Id_Names (Id,Name)
select ID,value
from #Temp1 t
CROSS APPLY STRING_SPLIT(Name,',');

After that, the query is simplified to :

SELECT 
    t1.ID,t2.ID as ColloidID
FROM #Id_Names  t1 
    INNER JOIN #Id_Names t2
ON t1.ID<>t2.ID 
    AND (t1.Name=t2.Name 
         OR t1.Name IS NULL
         OR t2.Name IS NULL)

This can run a lot faster. The only real problem is the logic of treating NULL as a wildcard. This will return the entire table. And since the table joins itself, each null will result in (20M-1)^2 extra rows. The same relations will be repeated twice, eg (1,4) and (4,1)

If #Temp1 was a proper table, an alternative would be to create an indexed view. Creating an index over a VIEW essentially generates, stores and updates its results automatically.

Another option is to create a Clustered Columnstore index. This provides both compression and acceleration. The data is stored per column in buckets of roughly 1M rows. In each bucket, each column value is only stored once.

create table #Id_Names (
    ID bigint not null,
    Name varchar(30) null,
    INDEX CCI_Id_Names CLUSTERED COLUMNSTORE
);