I have been struggling to get the right data using Checksum for last 15 days, and now I am trying to find other way.
I am trying to get any data output that has been changed from Previous day's file to Today's file on Punch Card's punch_start HOUR due to unexpected Time Zone hour change (not minute).
Please see the bottom sample of data.
Dataset1 (Yesterday's file):
chcecksum person_id applied_date punch_start punch_end punch_hours
-1552866149 650067 2022-09-04 2022-09-04T20:11:00Z 2022-09-04T22:52:00Z 2.68333333333333
-1367087212 650067 2022-09-04 2022-09-04T22:52:00Z 2022-09-04T23:26:00Z 0.566666666666667
Dataset2 (Today's file):
chcecksum person_id applied_date punch_start punch_end punch_hours
-1564056421 650067 2022-09-04 2022-09-04T20:11:00Z 2022-09-04T22:52:00Z 2.683333333
-1470176798 650067 2022-09-04 2022-09-04T20:52:00Z 2022-09-04T23:26:00Z 0.566666667
So, what I am trying to is if there is any change of HOUR (in this example) on punch_start only, it will notify (or select those ones).
In this case, there was change from 22:52:00Z to 20:52:00Z on the second entry.
Checksum would not work because if there is any change like 2.683333333 to 2.68333 (without change of punch_start), it will still create different checksum value.
The challenge is finding unique ID for those corresponding entries of two datasets, and it has been a struggle for me.
I have been using something like bottom to create an unique ID for each entry:
,concat(
[person_id],
[applied_date] ,
[punch_hours],
datepart(minute, convert(datetime, cast([punch_start] as datetime), 112))
But, it sill gives me a lot of duplicates because if somebody works from
9:00 AM -- 12:00 PM &
1:00 PM -- 5:00 PM on the same day,
it would create duplicates because they work on the same [applied_date] and same [punch_hours] and same [min].
How do we tackle this?
CodePudding user response:
You can use FULL OUTER JOIN
to identified rows that exists in one table but not in the other
select *
from Dataset1 d1
full outer join Dataset2 d2 on d1.person_id = d2.person_id
and d1.applied_date = d2.applied_date
and d1.punch_start = d2.punch_start
CodePudding user response:
Have you looked at using EXCEPT?
-- Prep data
select *
INTO #yesterday
from (values
(-1552866149 ,650067 , '2022-09-04', cast('2022-09-04T20:11:00Z' as datetime), cast('2022-09-04T22:52:00Z' as datetime) , 2.68333333333333 ),
(-1367087212 ,650067 , '2022-09-04', cast('2022-09-04T22:52:00Z' as datetime), cast('2022-09-04T23:26:00Z' as datetime) , 0.566666666666667)
)t1(chcecksum ,person_id ,applied_date ,punch_start ,punch_end ,punch_hours)
select *
INTO #today
from (values
(-1564056421 , 650067 ,'2022-09-04', cast('2022-09-04T20:11:00Z' as datetime), cast('2022-09-04T22:52:00Z' as datetime), 2.683333333),
(-1470176798 , 650067 ,'2022-09-04', cast('2022-09-04T20:52:00Z' as datetime), cast('2022-09-04T23:26:00Z' as datetime), 0.566666667)
)t2(chcecksum ,person_id ,applied_date ,punch_start ,punch_end ,punch_hours)
-- output
select
person_id,
applied_date,
punch_end,
Round(punch_hours, 4) as punch_hours, -- hope this is acceptable
datepart(HH, punch_start) as punch_start_hour, -- only looking for changes to HOUR
format(punch_start, 'yyyy-MM-dd XX:mm') as punch_start_hourless -- mask the the hour with XX so the rest of the Datetime can still be compared
from #yesterday
except
select
person_id,
applied_date,
punch_end,
Round(punch_hours, 4) as punch_hours,
datepart(HH, punch_start) as punch_start_hour,
format(punch_start, 'yyyy-MM-dd XX:mm') as punch_start_hourless
from #today
Wrap the 'output' query in this if you want to get the original values (minus the checksum )
SELECT
person_id
,applied_date
,Cast(REPLACE(punch_start_hourless, 'XX', punch_start_hour) as Datetime) as punch_start
,punch_end
,punch_hours
FROM (
-- insert query from above
) sub