Home > OS >  SQL recursively creating matching groups based on reference table
SQL recursively creating matching groups based on reference table

Time:06-20

Imagine you had a data source like:

Id Val Data_Date
1 A 2022-01-01
2 B 2022-01-05
3 C 2022-01-09
4 D 2022-01-31
5 E 2022-02-01

With a reference table matching values in this way:

Target_Val Matching_Val Valid_Start Valid_End
B A 2022-01-04 2022-01-06
C B 2022-01-09 2022-01-09
D A 2022-01-31 2022-01-31

Imagine you want to create a table grouping values together where there is a match in the reference table within X days, say 4. And you want to apply this matching recursively.

Output would be something like this:

Group_Id Id
1 1
1 2
1 3
2 4
3 5

The logic here would be that C matches to B in the appropriate date range, and B matches to A in the appropriate date range, therefore they are all one group. But although D matches to A, it is too far apart (greater than 4 days). And E doesn't match to anything.
There could be any depth (A > B > C > D ...)

Is there an appropriate algorithm in SQL to accomplish this? The values of the group IDs are unimportant and just meant to group data points together.

CodePudding user response:

Here's my attempt. You do indeed need a recursive CTE, but you need to join the source table to groups table and then join back to the source table to ensure that the child fits within the parent's 4 day window. E.g. in the case of D and A, as you mention, they match, but they aren't close enough to be counted.

Then I added a calc to work out which rows were valid hierarchies and used that for the recursive join, because we can exclude anything not part of a hierachy.

After that we need to order the records by their depth so we know which parent record is first, e.g. in the case of A > B > C.

Then DENSE_RANK over the results to get your final groups. This will need some testing with deeper levels of recursion though, but this should point you in the right direction:

CREATE TABLE SourceData
(
Id INTEGER,
Val CHAR(1),
Data_Date DATE 
);

CREATE TABLE Groups
(
Target_Val CHAR(1),
Matching_Val CHAR(1),
Valid_Start DATE,
Valid_End DATE
);

INSERT INTO SourceData (Id, Val, Data_Date) VALUES (1,'A','2022-01-01');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (2,'B','2022-01-05');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (3,'C','2022-01-09');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (4,'D','2022-01-31');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (5,'E','2022-02-01');


INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('B','A','2022-01-04','2022-01-06');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('C','B','2022-01-09','2022-01-09');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('D','A','2022-01-31','2022-01-31');

WITH sourceCTE AS
(

SELECT sd.Id, sd.Val, sd.Data_Date, g.Valid_Start, g.Valid_End, IIF(s.Val IS NULL, sd.Val, g.Matching_Val) [ParentVal], CAST(NULL AS DATE) [start], CAST(NULL AS DATE) [end], 1 [Depth],

IIF(s.Val IS NULL, 0, 1) IsHeirarchy
FROM SourceData sd
LEFT JOIN Groups g ON g.Target_Val = sd.Val AND sd.Data_Date BETWEEN g.Valid_Start AND g.Valid_End
LEFT JOIN SourceData s ON s.Val = g.Matching_Val AND ABS(DATEDIFF(DAY, s.Data_Date, sd.Data_Date)) < 5

UNION ALL

 
SELECT s.Id, s.Val, s.Data_Date, g.Valid_Start, g.Valid_End, g.Matching_Val, g.Valid_Start, g.Valid_End, s.[Depth]   1, 1
FROM sourceCTE s
INNER JOIN Groups g ON g.Target_Val = s.[ParentVal] AND s.IsHeirarchy = 1


),
ResultCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY [Depth] DESC) [RNum]
FROM sourceCTE
)

SELECT DENSE_RANK() OVER (ORDER BY ParentVal) [Group_Id], Id
FROM ResultCTE
WHERE [RNum] = 1

Here's a working fiddle.

I can't promise this is the best solution, because just like the query optimiser I gave up after about 2 hours, ha.

Also, for any future questions, please provide sample data in script format to save time creating the structure.

  • Related