I have a query to return sample values for each employee per calendar year, and a column that checks (yes/no) if the sample value is >= 60,000.
My initial data:
Employee_ID Calendar_Year Sample_Value Sample_Check
1234 2020 55,000 No
1234 2021 70,000 Yes
1234 2022 50,000 No
3456 2020 80,000 Yes
3456 2021 40,000 No
3456 2022 65,000 Yes
5678 2020 30,000 No
5678 2021 70,000 Yes
5678 2022 90,000 Yes
I would like to get this result, because this employee is the only one with "yes" for 2 consecutive calendar years.
Employee_ID Calendar_Year Sample_Value Sample_Check
5678 2022 90,000 Yes
I have looked up similar questions but could not find something that solves my issue. I have also looked into LAG and LEAD but need help in understanding if they can give me the result I want.
CodePudding user response:
I'm not sure how bullet proof this is. I used the lag function in a window partition to get the prior Sample_Check. I then matched on the outer query to get the record that (basically shows yes = yes). If you had 3 (Yes) in a row then it would pull back 2. You might be able to use some conditional logic to offset the rows if you ran into that scenario
SELECT
*
FROM
(
SELECT Employee_ID
,Calendar_Year
,Sample_Value
,Sample_Check
, LAG(Sample_Check) OVER (PARTITION BY Employee_ID ORDER BY Employee_ID ASC, Calendar_Year ASC) AS LagSampleCheck1
FROM EMPLOYEETABLE
) X
WHERE Sample_Check = LagSampleCheck1
ORDER BY Employee_ID ASC, Calendar_Year ASC
I also created this one that does another row_number() Over (Partition BY Employee ID and Order by Calendar year so it picks up the latest year if you have a situation where you have more than one that meet that criteria. I added another record to your original data set (Employee ID 5678, Calendar Year 2023, Samples Value and Sample Check Yes) too create two records.
Employee_ID Calendar_Year Sample_Value Sample_Check
1234 2020 55,000 No
1234 2021 70,000 Yes
1234 2022 50,000 No
3456 2020 80,000 Yes
3456 2021 40,000 No
3456 2022 65,000 Yes
5678 2020 30,000 No
5678 2021 70,000 Yes
5678 2022 90,000 Yes
5678 2023 90,000 Yes
SELECT
*
FROM
(
SELECT
*
, ROW_NUMBER() OVER (PARTITION BY EMPLOYEE_ID ORDER BY CALENDAR_YEAR DESC) AS ROWCOUNTER
FROM
(
SELECT Employee_ID
,Calendar_Year
,Sample_Value
,Sample_Check
, LAG(Sample_Check) OVER (PARTITION BY Employee_ID ORDER BY Employee_ID ASC, Calendar_Year ASC) AS LagSampleCheck1
FROM EMPLOYEETABLE
) X
WHERE Sample_Check = LagSampleCheck1
) Z
WHERE ROWCOUNTER = 1
ORDER BY Employee_ID ASC, Calendar_Year ASC
CodePudding user response:
I would tend towards using a correlated query to find qualifying rows, followed by a row_number window to select the greatest/least of each group you require:
with v as (
select *,
case when exists (
select * from t t2
where t2.Employee_ID = t.Employee_ID
and t.Sample_Check = 'Yes'
and t2.Sample_Check = 'Yes'
and t2.Calendar_Year = t.Calendar_Year - 1
) then 1 else 0 end valid
from t
), s as (
select *,
Row_Number() over(partition by Employee_ID, valid order by Calendar_Year desc) rn
from v
)
select Employee_Id, Calendar_Year, Sample_Value, Sample_Check
from s
where valid = 1 and rn = 1;
CodePudding user response:
This is the most straightforward solution . Just join the table with itself ( assuming calendar year is numeric )
SELECT t1.*, t2.sample_check
FROM data AS t1, data AS t2
WHERE t1.emp_id = t2.emp_id
AND t1.calendar_year = t2.calendar_year 1
AND t1.sample_check = t2.sample_check
AND t1.sample_check = 'Yes'
Also you can get same result with lag function with this;
WITH temp AS (SELECT emp_id
, calendar_year
, sample_value
, sample_check
, lag( CASE WHEN sample_check = 'Yes' THEN 1 ELSE 0 END, 1 )
OVER (PARTITION BY emp_id ORDER BY calendar_year) AS prevcheck
FROM data)
SELECT *
FROM temp
WHERE prevcheck = 1
AND sample_check = 'Yes'
Both gives the same result
emp_id calendar_year sample_value sample_check prevcheck
5678 2022 90 Yes 1