Trying to JOIN table of transactional data (and aggregate on the fly) using between dates-CodePudding

I'm trying to write a query to check an import table of aggregated data against the source system (transactional data). I feel like it should be simple, but I can't seem to get it figured.

The aggregate table (let's call it DW) looks like this:

NameID	StartTime	EndTime	Material	Total	Count
19Test	2022-12-04 23:15:00.000	2022-12-05 01:57:00.000	TAM	26	4
19Test	2022-12-04 23:15:00.000	2022-12-05 01:57:00.000	BVO	32	5
20Test	2022-12-04 20:15:00.000	2022-12-04 23:11:00.000	TAM	54	3
20Test	2022-12-04 20:15:00.000	2022-12-04 23:11:00.000	BVO	106	6

Total is the sum of Amount from the SRC table (below) and Count is the count of records that match the NameID and Material and occur between StartTime and EndTime.

I want to check against the source table to make sure the pipeline that picks up the information aggregated everything correctly.

Source Table (let's call it SRC) looks like this:

NameID	TimeStamp	Material	Amount
19Test	2022-12-04 22:15:56.000	TAM	3
19Test	2022-12-04 23:17:02.000	TAM	5
19Test	2022-12-04 23:18:06.000	TAM	7
19Test	2022-12-04 23:19:14.000	TAM	11
19Test	2022-12-04 23:20:19.000	BVO	1
19Test	2022-12-04 23:21:19.000	BVO	6
19Test	2022-12-04 23:22:33.000	BVO	9
19Test	2022-12-04 23:23:36.000	CAD	2
19Test	2022-12-04 23:24:53.000	BVO	6
19Test	2022-12-04 23:25:56.000	TAM	3
19Test	2022-12-04 23:27:23.000	CAD	8
19Test	2022-12-04 23:28:25.000	BVO	10
20Test	2022-12-04 18:15:56.000	TAM	11
20Test	2022-12-04 20:17:02.000	TAM	20
20Test	2022-12-04 20:18:06.000	TAM	33
20Test	2022-12-04 20:19:14.000	TAM	1
20Test	2022-12-04 20:20:19.000	BVO	7
20Test	2022-12-04 20:21:19.000	BVO	22
20Test	2022-12-04 20:22:33.000	BVO	31
20Test	2022-12-04 20:23:36.000	CAD	7
20Test	2022-12-04 20:24:53.000	BVO	11
20Test	2022-12-04 20:25:56.000	BVO	21
20Test	2022-12-04 20:27:23.000	CAD	29
20Test	2022-12-04 20:28:25.000	BVO	14

*I included 2 timestamped records that shouldn't match and also 4 material records that don't match

Ideally, I'm looking for a result set that would be an extension of the DW table by having columns represent the totals from the SRC table:

NameID	StartTime	EndTime	Material	Total	Count	Total - SRC	Count - SRC
19Test	2022-12-04 23:15:00.000	2022-12-05 01:57:00.000	TAM	26	4	26	4
19Test	2022-12-04 23:15:00.000	2022-12-05 01:57:00.000	BVO	32	5	32	5
20Test	2022-12-04 20:15:00.000	2022-12-04 23:11:00.000	TAM	54	3	54	3
20Test	2022-12-04 20:15:00.000	2022-12-04 23:11:00.000	BVO	106	6	106	6

I've tried something like below:

SELECT dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count, src.Total, src.Count
FROM dw
INNER JOIN (
    SELECT SUM(src.Amount) Total, count(*) count, src.NameID, src.Material
    FROM src
    WHERE src.NameID = dw.NameID
    AND src.SampleDateTime BETWEEN dw.StartTime AND dw.EndTime
    GROUP BY src.NameID, d.Material
) as d on dw.NameID = d.NameID

But I get the following error:

Msg 4104, Level 16, State 1, Line 80
The multi-part identifier "dw.Machine" could not be bound.

I've thought about using a cursor and temp tables to loop through and aggregate the data from the SRC table based on records from the DW table, but I think there's a more efficient (and easier) way to do this.

CodePudding user response：

I think you're definitely on the right path, you've just overcomplicated it imo.

Instead of the subquery, just leave it as a normal JOIN. Then with a few minor updates to grouping and some column name & join fixes:

SELECT dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count, 
       SUM(src.Amount) AS Total_SRC, count(*) AS Count_SRC
FROM   dw
       INNER JOIN src ON dw.NameID = src.NameID
                   AND dw.Material = src.Material 
                   AND src.SampleDateTime BETWEEN dw.StartTime AND dw.EndTime
GROUP BY dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count

Result are as below, and can be seen in this db<>fiddle

NameID  StartTime                EndTime                  Material  Total  Count  Total_SRC  Count_SRC
19Test  2022-12-04 23:15:00.000  2022-12-05 01:57:00.000  BVO       32     5      32         5
19Test  2022-12-04 23:15:00.000  2022-12-05 01:57:00.000  TAM       26     4      26         4
20Test  2022-12-04 20:15:00.000  2022-12-04 23:11:00.000  BVO       106    6      106        6
20Test  2022-12-04 20:15:00.000  2022-12-04 23:11:00.000  TAM       54     3      54         3

Note - if you need to run this frequently and/or want quick results, I'd look at the indexing on the src table. Indeed, I would probably either

Put the clustered index on Name_Id, Material, SampleDateTime (in order)
Put a non-clustered index on those fields, and INCLUDE Amount (though this basically creates a copy of the whole table)

CodePudding user response：

Just another way of thinking, since you want to check the imported table of aggregated data against the source system table, you could use the EXCEPT operator as the following:

The SQL EXCEPT clause/operator is used to combine two SELECT statements and returns rows from the first SELECT statement that are not returned by the second SELECT statement. This means EXCEPT returns only rows, which are not available in the second SELECT statement.

Select NameID,
       StartTime,
       EndTime,
       Material,
       Total,
       [Count]
From DW
Except
Select S.NameID, 
       D.StartTime,
       D.EndTime,
       S.Material,
       Sum(S.Amount) As Total,
       Count(*) As [Count]
From SRC S Join DW D
On S.NameID = D.NameID And S.Material  = D.Material
   And S.TimeStamp Between D.StartTime And D.EndTime
Group By S.NameID, S.Material, D.StartTime, D.EndTime

So no rows will be returned by this query when all of imported table rows are found in the aggregated data from the source table, which means that the imported table is aggregated correctly.

See a demo.