Imagine you had a data source like:
Id | Val | Data_Date |
---|---|---|
1 | A | 2022-01-01 |
2 | B | 2022-01-05 |
3 | C | 2022-01-09 |
4 | D | 2022-01-31 |
5 | E | 2022-02-01 |
With a reference table matching values in this way:
Target_Val | Matching_Val | Valid_Start | Valid_End |
---|---|---|---|
B | A | 2022-01-04 | 2022-01-06 |
C | B | 2022-01-09 | 2022-01-09 |
D | A | 2022-01-31 | 2022-01-31 |
Imagine you want to create a table grouping values together where there is a match in the reference table within X days, say 4. And you want to apply this matching recursively.
Output would be something like this:
Group_Id | Id |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
2 | 4 |
3 | 5 |
The logic here would be that C matches to B in the appropriate date range, and B matches to A in the appropriate date range, therefore they are all one group.
But although D matches to A, it is too far apart (greater than 4 days). And E doesn't match to anything.
There could be any depth (A > B > C > D ...)
Is there an appropriate algorithm in SQL to accomplish this? The values of the group IDs are unimportant and just meant to group data points together.
CodePudding user response:
Here's my attempt. You do indeed need a recursive CTE, but you need to join the source table to groups table and then join back to the source table to ensure that the child fits within the parent's 4 day window. E.g. in the case of D and A, as you mention, they match, but they aren't close enough to be counted.
Then I added a calc to work out which rows were valid hierarchies and used that for the recursive join, because we can exclude anything not part of a hierachy.
After that we need to order the records by their depth so we know which parent record is first, e.g. in the case of A > B > C.
Then DENSE_RANK over the results to get your final groups. This will need some testing with deeper levels of recursion though, but this should point you in the right direction:
CREATE TABLE SourceData
(
Id INTEGER,
Val CHAR(1),
Data_Date DATE
);
CREATE TABLE Groups
(
Target_Val CHAR(1),
Matching_Val CHAR(1),
Valid_Start DATE,
Valid_End DATE
);
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (1,'A','2022-01-01');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (2,'B','2022-01-05');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (3,'C','2022-01-09');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (4,'D','2022-01-31');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (5,'E','2022-02-01');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('B','A','2022-01-04','2022-01-06');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('C','B','2022-01-09','2022-01-09');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('D','A','2022-01-31','2022-01-31');
WITH sourceCTE AS
(
SELECT sd.Id, sd.Val, sd.Data_Date, g.Valid_Start, g.Valid_End, IIF(s.Val IS NULL, sd.Val, g.Matching_Val) [ParentVal], CAST(NULL AS DATE) [start], CAST(NULL AS DATE) [end], 1 [Depth],
IIF(s.Val IS NULL, 0, 1) IsHeirarchy
FROM SourceData sd
LEFT JOIN Groups g ON g.Target_Val = sd.Val AND sd.Data_Date BETWEEN g.Valid_Start AND g.Valid_End
LEFT JOIN SourceData s ON s.Val = g.Matching_Val AND ABS(DATEDIFF(DAY, s.Data_Date, sd.Data_Date)) < 5
UNION ALL
SELECT s.Id, s.Val, s.Data_Date, g.Valid_Start, g.Valid_End, g.Matching_Val, g.Valid_Start, g.Valid_End, s.[Depth] 1, 1
FROM sourceCTE s
INNER JOIN Groups g ON g.Target_Val = s.[ParentVal] AND s.IsHeirarchy = 1
),
ResultCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY [Depth] DESC) [RNum]
FROM sourceCTE
)
SELECT DENSE_RANK() OVER (ORDER BY ParentVal) [Group_Id], Id
FROM ResultCTE
WHERE [RNum] = 1
Here's a working fiddle.
I can't promise this is the best solution, because just like the query optimiser I gave up after about 2 hours, ha.
Also, for any future questions, please provide sample data in script format to save time creating the structure.