Creating a new RANK based on delta of previous row-CodePudding

I've been working on an issue for a few days now, and I can't seem to find the right fix. Does anybody have an idea?

Case

We want to create a new a new sequence number whenever an employee has resigned for more than 1 day. We have the delta of the current employment record and the previous, so we can check the sequence. We want to calculate the min(Start) and max(End) of each employment record which isn't separated more than 1 day apart.

Data

Employee	Contract	Unit	Start	End	Delta
John Doe	1	Unit A	2014-01-01	2017-12-31	NULL
John Doe	2	Unit A	2018-02-01	2018-12-31	31
John Doe	3	Unit B	2019-01-01	2020-05-31	1
John Doe	4	Unit A	2020-06-01	NULL	1

With the query it should give back:

Employee	Contract	Unit	Start	End	Delta	Sequence
John Doe	1	Unit A	2014-01-01	2017-12-31	NULL	1
John Doe	2	Unit A	2018-02-01	2018-12-31	31	2
John Doe	3	Unit B	2019-01-01	2020-05-31	1	2
John Doe	4	Unit A	2020-06-01	NULL	1	2

That is because sequence 1 end at 31-12-2017, and a new one starts in February of 2018, so there has been more than 1 day of separation between the records. The following all have a sequence of 2 because it is continuing.

Query

I've tried a few things already with lag() and lead(), but I keep working myself into a corner with the data sample that I have. When I run it on the full set it won't work.

SELECT
    Employee,
    Start,
    End,
    DeltaPrevious,
    Delta,
    DeltaNext,
    case
        when DeltaPrevious IS NULL AND Delta = 1 then 1
        when DeltaPrevious = 1 AND Delta > 1 then min(Contract) OVER (PARTITION BY Employee ORDER BY Contract ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
        when DeltaPrevious > 1 AND Delta = 1 then min(Contract) OVER (PARTITION BY Employee ORDER BY Contract ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
    end as Sequence
FROM
    Contracts
ORDER BY
    Employee, Start ASC

Hope that someone has a great idea.

Thanks,

CodePudding user response：

If I understood correctly from the definition of Sequence in your second table, you are more interested in the DeltaNext than in the Delta(Previous). Here an attempt, including the code to create a sample input date with two more employees:

CREATE TABLE #input_table (Employee VARCHAR(255), [Contract] INT, Unit VARCHAR(6), [Start] DATE, [End] DATE)

INSERT INTO #input_table
VALUES
('John Doe',    1,  'Unit A',   '2014-01-01',   '2017-12-31'),
('John Doe',    2,  'Unit A',   '2018-02-01',   '2018-12-31'),
('John Doe',    3,  'Unit B',   '2019-01-01',   '2020-05-31'),
('John Doe',    4,  'Unit A',   '2020-06-01',   NULL),
('Alice',       1,  'Unit A',   '2020-01-01',   NULL),
('Bob',         1,  'Unit C',   '2020-01-01',   '2020-02-20')

First we create the deltas:

SELECT *
    , DeltaPrev = DATEDIFF(DAY, LAG([End], 1, NULL) OVER(PARTITION BY Employee 
ORDER BY [Start]), [Start])  -- Not relevant (?)
    , DeltaNext = DATEDIFF(DAY, [End], LEAD([Start], 1, NULL) OVER(PARTITION BY Employee ORDER BY [Start]))
INTO #cte_delta -- I'll create a CTE at the end
FROM #input_table

Then we define Sequence:

SELECT *
    , [Sequence] = CASE WHEN DeltaNext > 1 THEN 1 ELSE 2 END
INTO #cte_sequence
FROM #cte_delta

We then group the same Sequences by assigning a unique ROW_NUMBER for each employee with consecutive/ same Sequences:

SELECT *
    , GRP = ROW_NUMBER() OVER(PARTITION BY Employee ORDER BY [Start]) - ROW_NUMBER() OVER(PARTITION BY Employee, [Sequence] ORDER BY [Start]) 
INTO #cte_grp
FROM #cte_sequence

Finally we calculate the min and max of the contract duration:

SELECT *
    , MIN([Start]) OVER(PARTITION BY Employee, GRP) AS ContractStart
    , CASE WHEN COUNT(*) OVER(PARTITION BY Employee, GRP) = COUNT([End]) 
OVER(PARTITION BY Employee, GRP) THEN MAX([End]) OVER(PARTITION BY Employee, GRP) ELSE NULL END AS ContractEnd  
FROM cte_grp

The COUNT(*) and COUNT([End]) comparison is necessary or else the ContractEnd would be the max non-NULL value, i.e. 2018-02-01.

The whole code with CTEs here:

WITH cte_delta AS (
    SELECT *
        , DeltaPrev = DATEDIFF(DAY, LAG([End], 1, NULL) OVER(PARTITION BY Employee ORDER BY [Start]), [Start])  -- Not relevant (?)
        , DeltaNext = DATEDIFF(DAY, [End], LEAD([Start], 1, NULL) OVER(PARTITION BY Employee ORDER BY [Start]))
    FROM #input_table
)
, cte_sequence AS (
    SELECT *
        , [Sequence] = CASE WHEN DeltaNext > 1 THEN 1 ELSE 2 END
    FROM cte_delta
)
, cte_grp AS (
SELECT *
    , GRP = ROW_NUMBER() OVER(PARTITION BY Employee ORDER BY [Start]) - ROW_NUMBER() OVER(PARTITION BY Employee, [Sequence] ORDER BY [Start])
FROM cte_sequence
) 
SELECT *
    , MIN([Start]) OVER(PARTITION BY Employee, GRP) AS ContractStart
    , CASE WHEN COUNT(*) OVER(PARTITION BY Employee, GRP) = COUNT([End]) OVER(PARTITION BY Employee, GRP) THEN MAX([End]) OVER(PARTITION BY Employee, GRP) ELSE NULL END AS ContractEnd  
FROM cte_grp

Here the output:

Employee	Contract	Unit	Start	End	DeltaPrev	DeltaNext	Sequence	GRP	ContractStart	ContractEnd
Alice	1	Unit A	2020-01-01	NULL	NULL	NULL	2	0	2020-01-01	NULL
Bob	1	Unit C	2020-01-01	2020-02-20	NULL	NULL	2	0	2020-01-01	2020-02-20
John Doe	1	Unit A	2014-01-01	2017-12-31	NULL	32	1	0	2014-01-01	2017-12-31
John Doe	2	Unit A	2018-02-01	2018-12-31	32	1	2	1	2018-02-01	NULL
John Doe	3	Unit B	2019-01-01	2020-05-31	1	1	2	1	2018-02-01	NULL
John Doe	4	Unit A	2020-06-01	NULL	1	NULL	2	1	2018-02-01	NULL

Feel free to select DISTINCT records according to your needs.

CodePudding user response：

Basically, you want to use lag() to get the previous date and then do a cumulative sum. This looks like:

select c.*,
       sum(case when prev_end >= dateadd(day, -1, start) then 0 else 1
           end) over (partition by employee order by start) as ranking
from (select c.*,
             lag(end) over (partition by employee order by start) as prev_end
      from contracts c
     ) c;

You mention that you might want to recalculate the new start and end. You would just use the above as a subquery/CTE and aggregate on employee and ranking.