Home > database >  Trying to JOIN table of transactional data (and aggregate on the fly) using between dates
Trying to JOIN table of transactional data (and aggregate on the fly) using between dates

Time:12-11

I'm trying to write a query to check an import table of aggregated data against the source system (transactional data). I feel like it should be simple, but I can't seem to get it figured.

The aggregate table (let's call it DW) looks like this:

NameID StartTime EndTime Material Total Count
19Test 2022-12-04 23:15:00.000 2022-12-05 01:57:00.000 TAM 26 4
19Test 2022-12-04 23:15:00.000 2022-12-05 01:57:00.000 BVO 32 5
20Test 2022-12-04 20:15:00.000 2022-12-04 23:11:00.000 TAM 54 3
20Test 2022-12-04 20:15:00.000 2022-12-04 23:11:00.000 BVO 106 6

Total is the sum of Amount from the SRC table (below) and Count is the count of records that match the NameID and Material and occur between StartTime and EndTime.

I want to check against the source table to make sure the pipeline that picks up the information aggregated everything correctly.

Source Table (let's call it SRC) looks like this:

NameID TimeStamp Material Amount
19Test 2022-12-04 22:15:56.000 TAM 3
19Test 2022-12-04 23:17:02.000 TAM 5
19Test 2022-12-04 23:18:06.000 TAM 7
19Test 2022-12-04 23:19:14.000 TAM 11
19Test 2022-12-04 23:20:19.000 BVO 1
19Test 2022-12-04 23:21:19.000 BVO 6
19Test 2022-12-04 23:22:33.000 BVO 9
19Test 2022-12-04 23:23:36.000 CAD 2
19Test 2022-12-04 23:24:53.000 BVO 6
19Test 2022-12-04 23:25:56.000 TAM 3
19Test 2022-12-04 23:27:23.000 CAD 8
19Test 2022-12-04 23:28:25.000 BVO 10
20Test 2022-12-04 18:15:56.000 TAM 11
20Test 2022-12-04 20:17:02.000 TAM 20
20Test 2022-12-04 20:18:06.000 TAM 33
20Test 2022-12-04 20:19:14.000 TAM 1
20Test 2022-12-04 20:20:19.000 BVO 7
20Test 2022-12-04 20:21:19.000 BVO 22
20Test 2022-12-04 20:22:33.000 BVO 31
20Test 2022-12-04 20:23:36.000 CAD 7
20Test 2022-12-04 20:24:53.000 BVO 11
20Test 2022-12-04 20:25:56.000 BVO 21
20Test 2022-12-04 20:27:23.000 CAD 29
20Test 2022-12-04 20:28:25.000 BVO 14

*I included 2 timestamped records that shouldn't match and also 4 material records that don't match

Ideally, I'm looking for a result set that would be an extension of the DW table by having columns represent the totals from the SRC table:

NameID StartTime EndTime Material Total Count Total - SRC Count - SRC
19Test 2022-12-04 23:15:00.000 2022-12-05 01:57:00.000 TAM 26 4 26 4
19Test 2022-12-04 23:15:00.000 2022-12-05 01:57:00.000 BVO 32 5 32 5
20Test 2022-12-04 20:15:00.000 2022-12-04 23:11:00.000 TAM 54 3 54 3
20Test 2022-12-04 20:15:00.000 2022-12-04 23:11:00.000 BVO 106 6 106 6

I've tried something like below:

SELECT dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count, src.Total, src.Count
FROM dw
INNER JOIN (
    SELECT SUM(src.Amount) Total, count(*) count, src.NameID, src.Material
    FROM src
    WHERE src.NameID = dw.NameID
    AND src.SampleDateTime BETWEEN dw.StartTime AND dw.EndTime
    GROUP BY src.NameID, d.Material
) as d on dw.NameID = d.NameID

But I get the following error:

Msg 4104, Level 16, State 1, Line 80
The multi-part identifier "dw.Machine" could not be bound.

I've thought about using a cursor and temp tables to loop through and aggregate the data from the SRC table based on records from the DW table, but I think there's a more efficient (and easier) way to do this.

CodePudding user response:

I think you're definitely on the right path, you've just overcomplicated it imo.

Instead of the subquery, just leave it as a normal JOIN. Then with a few minor updates to grouping and some column name & join fixes:

SELECT dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count, 
       SUM(src.Amount) AS Total_SRC, count(*) AS Count_SRC
FROM   dw
       INNER JOIN src ON dw.NameID = src.NameID
                   AND dw.Material = src.Material 
                   AND src.SampleDateTime BETWEEN dw.StartTime AND dw.EndTime
GROUP BY dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count

Result are as below, and can be seen in this db<>fiddle

NameID  StartTime                EndTime                  Material  Total  Count  Total_SRC  Count_SRC
19Test  2022-12-04 23:15:00.000  2022-12-05 01:57:00.000  BVO       32     5      32         5
19Test  2022-12-04 23:15:00.000  2022-12-05 01:57:00.000  TAM       26     4      26         4
20Test  2022-12-04 20:15:00.000  2022-12-04 23:11:00.000  BVO       106    6      106        6
20Test  2022-12-04 20:15:00.000  2022-12-04 23:11:00.000  TAM       54     3      54         3

Note - if you need to run this frequently and/or want quick results, I'd look at the indexing on the src table. Indeed, I would probably either

  • Put the clustered index on Name_Id, Material, SampleDateTime (in order)
  • Put a non-clustered index on those fields, and INCLUDE Amount (though this basically creates a copy of the whole table)

CodePudding user response:

Just another way of thinking, since you want to check the imported table of aggregated data against the source system table, you could use the EXCEPT operator as the following:

The SQL EXCEPT clause/operator is used to combine two SELECT statements and returns rows from the first SELECT statement that are not returned by the second SELECT statement. This means EXCEPT returns only rows, which are not available in the second SELECT statement.

Select NameID,
       StartTime,
       EndTime,
       Material,
       Total,
       [Count]
From DW
Except
Select S.NameID, 
       D.StartTime,
       D.EndTime,
       S.Material,
       Sum(S.Amount) As Total,
       Count(*) As [Count]
From SRC S Join DW D
On S.NameID = D.NameID And S.Material  = D.Material
   And S.TimeStamp Between D.StartTime And D.EndTime
Group By S.NameID, S.Material, D.StartTime, D.EndTime

So no rows will be returned by this query when all of imported table rows are found in the aggregated data from the source table, which means that the imported table is aggregated correctly.

See a demo.

  • Related