How to get this query logic (instead of using Checksum)?-CodePudding

I have been struggling to get the right data using Checksum for last 15 days, and now I am trying to find other way.

I am trying to get any data output that has been changed from Previous day's file to Today's file on Punch Card's punch_start HOUR due to unexpected Time Zone hour change (not minute).

Please see the bottom sample of data.

Dataset1 (Yesterday's file):

 chcecksum   person_id applied_date punch_start          punch_end            punch_hours
-1552866149  650067    2022-09-04   2022-09-04T20:11:00Z 2022-09-04T22:52:00Z 2.68333333333333
-1367087212  650067    2022-09-04   2022-09-04T22:52:00Z 2022-09-04T23:26:00Z 0.566666666666667

Dataset2 (Today's file):

chcecksum   person_id applied_date  punch_start           punch_end             punch_hours
-1564056421 650067    2022-09-04    2022-09-04T20:11:00Z  2022-09-04T22:52:00Z  2.683333333
-1470176798 650067    2022-09-04    2022-09-04T20:52:00Z  2022-09-04T23:26:00Z  0.566666667

So, what I am trying to is if there is any change of HOUR (in this example) on punch_start only, it will notify (or select those ones).

In this case, there was change from 22:52:00Z to 20:52:00Z on the second entry.

Checksum would not work because if there is any change like 2.683333333 to 2.68333 (without change of punch_start), it will still create different checksum value.

The challenge is finding unique ID for those corresponding entries of two datasets, and it has been a struggle for me.

I have been using something like bottom to create an unique ID for each entry:

,concat(
        [person_id],
        [applied_date] ,
        [punch_hours], 
        datepart(minute,  convert(datetime, cast([punch_start] as datetime), 112))

But, it sill gives me a lot of duplicates because if somebody works from

9:00 AM -- 12:00 PM &

1:00 PM -- 5:00 PM on the same day,

it would create duplicates because they work on the same [applied_date] and same [punch_hours] and same [min].

How do we tackle this?

CodePudding user response：

You can use FULL OUTER JOIN to identified rows that exists in one table but not in the other

select *
from   Dataset1 d1
       full outer join Dataset2 d2 on  d1.person_id    = d2.person_id
                                   and d1.applied_date = d2.applied_date
                                   and d1.punch_start  = d2.punch_start

CodePudding user response：

Have you looked at using EXCEPT?

-- Prep data
select * 
INTO #yesterday
from (values 
(-1552866149  ,650067   , '2022-09-04',  cast('2022-09-04T20:11:00Z' as datetime), cast('2022-09-04T22:52:00Z' as datetime)   , 2.68333333333333 ),
(-1367087212  ,650067   , '2022-09-04',  cast('2022-09-04T22:52:00Z' as datetime), cast('2022-09-04T23:26:00Z' as datetime)   , 0.566666666666667)
)t1(chcecksum   ,person_id ,applied_date  ,punch_start             ,punch_end               ,punch_hours)


select * 
INTO #today
from (values 
(-1564056421 , 650067    ,'2022-09-04', cast('2022-09-04T20:11:00Z' as datetime), cast('2022-09-04T22:52:00Z' as datetime),  2.683333333),
(-1470176798 , 650067    ,'2022-09-04', cast('2022-09-04T20:52:00Z' as datetime), cast('2022-09-04T23:26:00Z' as datetime),  0.566666667)
)t2(chcecksum   ,person_id ,applied_date  ,punch_start             ,punch_end               ,punch_hours)


-- output    
select 
    person_id, 
    applied_date, 
    punch_end, 
    Round(punch_hours, 4) as punch_hours,                           -- hope this is acceptable
    datepart(HH, punch_start) as punch_start_hour,                  -- only looking for changes to HOUR
    format(punch_start, 'yyyy-MM-dd XX:mm') as punch_start_hourless -- mask the the hour with XX so the rest of the Datetime can still be compared
from #yesterday
except 
select 
    person_id, 
    applied_date, 
    punch_end, 
    Round(punch_hours, 4) as punch_hours, 
    datepart(HH, punch_start) as punch_start_hour,  
    format(punch_start, 'yyyy-MM-dd XX:mm') as punch_start_hourless
from #today

Wrap the 'output' query in this if you want to get the original values (minus the checksum )

SELECT 
    person_id   
    ,applied_date   
    ,Cast(REPLACE(punch_start_hourless, 'XX', punch_start_hour) as Datetime) as punch_start
    ,punch_end  
    ,punch_hours    
FROM (
-- insert query from above
) sub