Home > Back-end >  Seperate Overlapping Date Records with Priority
Seperate Overlapping Date Records with Priority

Time:11-30

I have been given some very poorly maintained date tables to sort out, containing information about which position members of staff held at given times. In many cases, the staff are listed with overlapping positions by date (so the data implies they held two positions at one time which should be impossible). Work is being carried out to correct the data, so I can assume that later information is a correction to older information.

I would like to clean the data so as to remove the duplication in any time frame, taking the latest modified record as the "truth" over the period between the start and end dates of the position record it applies for.

I think a diagram will help, so the below shows some possible dirty data, where one employee has been given 6 positions over a period of time. Each line represents the span of a single position (for instance the employee held position #1 between the 10th and the 20th of January). The y axis represents the date modified - position #1 was modified after position #2 and so on.

Dirty Data Diagram

I have made a table to represent the data shown above:

DROP TABLE IF EXISTS #Test;
CREATE TABLE #Test
( Id INT IDENTITY(1,1)
 ,Person    INT
 ,Job       INT
 ,JobStart  DATE
 ,JobEnd    DATE
 ,Modified  DATE
);

INSERT #Test
VALUES
 (1,1,'2020-01-10','2020-01-20',GETDATE())
,(1,2,'2020-01-03','2020-01-10',DATEADD(DAY,-1,GETDATE()))
,(1,3,'2020-01-03','2020-01-13',DATEADD(DAY,-2,GETDATE()))
,(1,4,'2020-01-11','2020-01-20',DATEADD(DAY,-3,GETDATE()))
,(1,5,'2020-01-15','2020-01-25',DATEADD(DAY,-4,GETDATE()))
,(1,6,'2020-01-01','2020-01-30',DATEADD(DAY,-5,GETDATE()))

SELECT * FROM #Test;
DROP TABLE IF EXISTS #Test;

If you imagined that you could look "down" on the first diagram from the top of the y axis, you would see only the most recently modified position for any given time - this is what I would like to return (I don't care about the modified date in the results):

Clean Data Diagram

The output I, as attempted to be demonstrated above, is as follows:

Record Person Job JobStart JobEnd
1 1 6 2020-01-01 2020-01-03
2 1 2 2020-01-03 2020-01-10
3 1 1 2020-01-10 2020-01-20
6 1 5 2020-01-20 2020-01-25
7 1 6 2020-01-25 2020-01-30

I have attempted to modify the answers given in this question and this question, but haven't had any luck applying the modified date priority - they work to a certain extent but always seem to give confusingly incorrect results, and always return one row for each position, whereas I want to potentially create and or discard rows depending on their dates. I even tried using an extensive CASE statement, but that went about as well as you might expect.

I would welcome any suggestions of how to clean the data in this way as I have been at it all morning and spent my lunchbreak making diagrams and still no closer to a solution!

I am working in SQL Server V13.0.4

P.S. The closest I have come is using a Date Tally table that just contains a list of dates:

;WITH AllDays AS 
(   SELECT
         Person
        ,Dated.[Date]
        ,MAX(Modified) AS LastModified
    FROM        #Test       AS Test
    INNER JOIN  ListOfDates AS Dated    ON  Test.JobStart <= Dated.[Date]
                                                    AND Test.JobEnd >= Dated.[Date]
    GROUP BY Person,[Date])
,Thing AS
(
SELECT
     Test.Person
    ,Test.Job
    ,MIN([Date]) OVER (PARTITION BY Test.Job ORDER BY [Date])   AS JobStart
    ,MAX([Date]) OVER (PARTITION BY Test.Job ORDER BY [Date])   AS JobEnd
FROM
            #Test   AS Test
INNER JOIN  AllDays AS Limit    ON  Test.Person = Limit.Person
                                AND Test.Modified = Limit.LastModified
)

The problem in this case is Position #6, which is still listed as only starting on the first... It feels really close, though!

I have several ways of getting to that same result, but the real sticking problem seems to be with Position #6, which needs to appear twice in the results...

;WITH Thing AS
(
SELECT
     ROW_NUMBER() OVER (PARTITION BY T1.Person, [Date] ORDER BY [Modified] DESC) AS MY_Rank
    ,*
FROM 
            #Test       AS T1
CROSS APPLY ListOfDates AS Dated
WHERE
    Dated.[Date] >= (SELECT MIN(JobStart) FROM #Test)
AND Dated.[Date] <= (SELECT MAX(JobEnd) FROM #Test)
AND Dated.[Date] >= T1.JobStart AND Dated.[Date] <= T1.JobEnd
)

SELECT DISTINCT Id, Person, Job, JobStart, JobEnd
FROM     Thing
WHERE
    MY_Rank = 1

CodePudding user response:

Here's an attempt. But it's as far as I'll get with this SQL puzzle.

The idea was to generate a Tally CTE for all dates.
Join the dates to the ranges.
Use row_number to assign the priority job for a date.
Filter priority 1 and generate a ranking.
Group it all up with the ranking.

WITH RCTE_DATES AS (
   SELECT MIN(JobStart) AS JobDate, MAX(JobEnd) AS LastDate
   FROM #Test
   UNION ALL
   SELECT DATEADD(day, 1, JobDate), LastDate
   FROM RCTE_DATES
   WHERE JobDate < LastDate
)
, CTE_RANGES AS
(
   SELECT JobDate, Person, Job, JobStart, JobEnd
   , ROW_NUMBER() OVER (PARTITION BY Person, JobDate ORDER BY Modified DESC, JobEnd) rn
   FROM RCTE_DATES d
   JOIN #Test t ON d.JobDate BETWEEN t.JobStart AND t.JobEnd
)
, CTE_RANKED AS
(
   SELECT JobDate, Person, Job
   , ROW_NUMBER() OVER (PARTITION BY Person, Job ORDER BY JobDate DESC) 
     ROW_NUMBER() OVER (ORDER BY JobDate) AS Rnk
   FROM CTE_RANGES
   WHERE rn = 1
   GROUP BY JobDate, Person, Job
)
SELECT Person, Job
, MIN(JobDate) AS JobStart
, MAX(JobDate) AS JobEnd
FROM CTE_RANKED
GROUP BY Person, Job, Rnk
ORDER BY JobStart
Person | Job | JobStart   | JobEnd    
-----: | --: | :--------- | :---------
     1 |   6 | 2020-01-01 | 2020-01-02
     1 |   2 | 2020-01-03 | 2020-01-09
     1 |   1 | 2020-01-10 | 2020-01-20
     1 |   5 | 2020-01-21 | 2020-01-25
     1 |   6 | 2020-01-26 | 2020-01-30

db<>fiddle here

Side-note: If the lines of the first graph are put on different hights, you can get a different second top view graph. In the sql that can be done by changing the order in CTE_RANGES

  • Related