I'm trying to write a query to check an import table of aggregated data against the source system (transactional data). I feel like it should be simple, but I can't seem to get it figured.
The aggregate table (let's call it DW) looks like this:
NameID | StartTime | EndTime | Material | Total | Count |
---|---|---|---|---|---|
19Test | 2022-12-04 23:15:00.000 | 2022-12-05 01:57:00.000 | TAM | 26 | 4 |
19Test | 2022-12-04 23:15:00.000 | 2022-12-05 01:57:00.000 | BVO | 32 | 5 |
20Test | 2022-12-04 20:15:00.000 | 2022-12-04 23:11:00.000 | TAM | 54 | 3 |
20Test | 2022-12-04 20:15:00.000 | 2022-12-04 23:11:00.000 | BVO | 106 | 6 |
Total is the sum of Amount from the SRC table (below) and Count is the count of records that match the NameID and Material and occur between StartTime and EndTime.
I want to check against the source table to make sure the pipeline that picks up the information aggregated everything correctly.
Source Table (let's call it SRC) looks like this:
NameID | TimeStamp | Material | Amount |
---|---|---|---|
19Test | 2022-12-04 22:15:56.000 | TAM | 3 |
19Test | 2022-12-04 23:17:02.000 | TAM | 5 |
19Test | 2022-12-04 23:18:06.000 | TAM | 7 |
19Test | 2022-12-04 23:19:14.000 | TAM | 11 |
19Test | 2022-12-04 23:20:19.000 | BVO | 1 |
19Test | 2022-12-04 23:21:19.000 | BVO | 6 |
19Test | 2022-12-04 23:22:33.000 | BVO | 9 |
19Test | 2022-12-04 23:23:36.000 | CAD | 2 |
19Test | 2022-12-04 23:24:53.000 | BVO | 6 |
19Test | 2022-12-04 23:25:56.000 | TAM | 3 |
19Test | 2022-12-04 23:27:23.000 | CAD | 8 |
19Test | 2022-12-04 23:28:25.000 | BVO | 10 |
20Test | 2022-12-04 18:15:56.000 | TAM | 11 |
20Test | 2022-12-04 20:17:02.000 | TAM | 20 |
20Test | 2022-12-04 20:18:06.000 | TAM | 33 |
20Test | 2022-12-04 20:19:14.000 | TAM | 1 |
20Test | 2022-12-04 20:20:19.000 | BVO | 7 |
20Test | 2022-12-04 20:21:19.000 | BVO | 22 |
20Test | 2022-12-04 20:22:33.000 | BVO | 31 |
20Test | 2022-12-04 20:23:36.000 | CAD | 7 |
20Test | 2022-12-04 20:24:53.000 | BVO | 11 |
20Test | 2022-12-04 20:25:56.000 | BVO | 21 |
20Test | 2022-12-04 20:27:23.000 | CAD | 29 |
20Test | 2022-12-04 20:28:25.000 | BVO | 14 |
*I included 2 timestamped records that shouldn't match and also 4 material records that don't match
Ideally, I'm looking for a result set that would be an extension of the DW table by having columns represent the totals from the SRC table:
NameID | StartTime | EndTime | Material | Total | Count | Total - SRC | Count - SRC |
---|---|---|---|---|---|---|---|
19Test | 2022-12-04 23:15:00.000 | 2022-12-05 01:57:00.000 | TAM | 26 | 4 | 26 | 4 |
19Test | 2022-12-04 23:15:00.000 | 2022-12-05 01:57:00.000 | BVO | 32 | 5 | 32 | 5 |
20Test | 2022-12-04 20:15:00.000 | 2022-12-04 23:11:00.000 | TAM | 54 | 3 | 54 | 3 |
20Test | 2022-12-04 20:15:00.000 | 2022-12-04 23:11:00.000 | BVO | 106 | 6 | 106 | 6 |
I've tried something like below:
SELECT dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count, src.Total, src.Count
FROM dw
INNER JOIN (
SELECT SUM(src.Amount) Total, count(*) count, src.NameID, src.Material
FROM src
WHERE src.NameID = dw.NameID
AND src.SampleDateTime BETWEEN dw.StartTime AND dw.EndTime
GROUP BY src.NameID, d.Material
) as d on dw.NameID = d.NameID
But I get the following error:
Msg 4104, Level 16, State 1, Line 80
The multi-part identifier "dw.Machine" could not be bound.
I've thought about using a cursor and temp tables to loop through and aggregate the data from the SRC table based on records from the DW table, but I think there's a more efficient (and easier) way to do this.
CodePudding user response:
I think you're definitely on the right path, you've just overcomplicated it imo.
Instead of the subquery, just leave it as a normal JOIN. Then with a few minor updates to grouping and some column name & join fixes:
SELECT dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count,
SUM(src.Amount) AS Total_SRC, count(*) AS Count_SRC
FROM dw
INNER JOIN src ON dw.NameID = src.NameID
AND dw.Material = src.Material
AND src.SampleDateTime BETWEEN dw.StartTime AND dw.EndTime
GROUP BY dw.NameID, dw.StartTime, dw.EndTime, dw.Material, dw.Total, dw.Count
Result are as below, and can be seen in this db<>fiddle
NameID StartTime EndTime Material Total Count Total_SRC Count_SRC
19Test 2022-12-04 23:15:00.000 2022-12-05 01:57:00.000 BVO 32 5 32 5
19Test 2022-12-04 23:15:00.000 2022-12-05 01:57:00.000 TAM 26 4 26 4
20Test 2022-12-04 20:15:00.000 2022-12-04 23:11:00.000 BVO 106 6 106 6
20Test 2022-12-04 20:15:00.000 2022-12-04 23:11:00.000 TAM 54 3 54 3
Note - if you need to run this frequently and/or want quick results, I'd look at the indexing on the src table. Indeed, I would probably either
- Put the clustered index on Name_Id, Material, SampleDateTime (in order)
- Put a non-clustered index on those fields, and INCLUDE Amount (though this basically creates a copy of the whole table)
CodePudding user response:
Just another way of thinking, since you want to check the imported table of aggregated data against the source system table, you could use the EXCEPT
operator as the following:
The SQL EXCEPT clause/operator is used to combine two SELECT statements and returns rows from the first SELECT statement that are not returned by the second SELECT statement. This means EXCEPT returns only rows, which are not available in the second SELECT statement.
Select NameID,
StartTime,
EndTime,
Material,
Total,
[Count]
From DW
Except
Select S.NameID,
D.StartTime,
D.EndTime,
S.Material,
Sum(S.Amount) As Total,
Count(*) As [Count]
From SRC S Join DW D
On S.NameID = D.NameID And S.Material = D.Material
And S.TimeStamp Between D.StartTime And D.EndTime
Group By S.NameID, S.Material, D.StartTime, D.EndTime
So no rows will be returned by this query when all of imported table rows are found in the aggregated data from the source table, which means that the imported table is aggregated correctly.
See a demo.