SQL recursively creating matching groups based on reference table-CodePudding

Imagine you had a data source like:

Id	Val	Data_Date
1	A	2022-01-01
2	B	2022-01-05
3	C	2022-01-09
4	D	2022-01-31
5	E	2022-02-01

With a reference table matching values in this way:

Target_Val	Matching_Val	Valid_Start	Valid_End
B	A	2022-01-04	2022-01-06
C	B	2022-01-09	2022-01-09
D	A	2022-01-31	2022-01-31

Imagine you want to create a table grouping values together where there is a match in the reference table within X days, say 4. And you want to apply this matching recursively.

Output would be something like this:

Group_Id	Id
1	1
1	2
1	3
2	4
3	5

The logic here would be that C matches to B in the appropriate date range, and B matches to A in the appropriate date range, therefore they are all one group. But although D matches to A, it is too far apart (greater than 4 days). And E doesn't match to anything.
There could be any depth (A > B > C > D ...)

Is there an appropriate algorithm in SQL to accomplish this? The values of the group IDs are unimportant and just meant to group data points together.

CodePudding user response：

Here's my attempt. You do indeed need a recursive CTE, but you need to join the source table to groups table and then join back to the source table to ensure that the child fits within the parent's 4 day window. E.g. in the case of D and A, as you mention, they match, but they aren't close enough to be counted.

Then I added a calc to work out which rows were valid hierarchies and used that for the recursive join, because we can exclude anything not part of a hierachy.

After that we need to order the records by their depth so we know which parent record is first, e.g. in the case of A > B > C.

Then DENSE_RANK over the results to get your final groups. This will need some testing with deeper levels of recursion though, but this should point you in the right direction:

CREATE TABLE SourceData
(
Id INTEGER,
Val CHAR(1),
Data_Date DATE 
);

CREATE TABLE Groups
(
Target_Val CHAR(1),
Matching_Val CHAR(1),
Valid_Start DATE,
Valid_End DATE
);

INSERT INTO SourceData (Id, Val, Data_Date) VALUES (1,'A','2022-01-01');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (2,'B','2022-01-05');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (3,'C','2022-01-09');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (4,'D','2022-01-31');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (5,'E','2022-02-01');


INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('B','A','2022-01-04','2022-01-06');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('C','B','2022-01-09','2022-01-09');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('D','A','2022-01-31','2022-01-31');

WITH sourceCTE AS
(

SELECT sd.Id, sd.Val, sd.Data_Date, g.Valid_Start, g.Valid_End, IIF(s.Val IS NULL, sd.Val, g.Matching_Val) [ParentVal], CAST(NULL AS DATE) [start], CAST(NULL AS DATE) [end], 1 [Depth],

IIF(s.Val IS NULL, 0, 1) IsHeirarchy
FROM SourceData sd
LEFT JOIN Groups g ON g.Target_Val = sd.Val AND sd.Data_Date BETWEEN g.Valid_Start AND g.Valid_End
LEFT JOIN SourceData s ON s.Val = g.Matching_Val AND ABS(DATEDIFF(DAY, s.Data_Date, sd.Data_Date)) < 5

UNION ALL

 
SELECT s.Id, s.Val, s.Data_Date, g.Valid_Start, g.Valid_End, g.Matching_Val, g.Valid_Start, g.Valid_End, s.[Depth]   1, 1
FROM sourceCTE s
INNER JOIN Groups g ON g.Target_Val = s.[ParentVal] AND s.IsHeirarchy = 1


),
ResultCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY [Depth] DESC) [RNum]
FROM sourceCTE
)

SELECT DENSE_RANK() OVER (ORDER BY ParentVal) [Group_Id], Id
FROM ResultCTE
WHERE [RNum] = 1

Here's a working fiddle.

I can't promise this is the best solution, because just like the query optimiser I gave up after about 2 hours, ha.

Also, for any future questions, please provide sample data in script format to save time creating the structure.