I am trying to create a dimension table (NewTable
) from an existing Data Warehouse table (OldTable
) that doesn't have a primary key.
The OldTable
holds distinct values in [IdentifierCode]
and other values repeat around it. I also need to invoke 3 functions to add reporting context.
I want IdentifierCode_ID
to be an INT column - as the [IdentifierCode]
column is VARCHAR(6)
.
My question is this: is using ROW_NUMBER()
(as shown below) producing a suitably unique value?
My concern is that the row order on the live table could change if other rows are inserted to remediate missed codes.
Edit: OldTable
has 500k rows in all and 227k when filtered with the WHERE
clause
SELECT
ROW_NUMBER() OVER (ORDER BY LoadDate, StartDate, Product, IdentifierCode) AS IdentifierCode_ID,
LoadDate,
StartDate,
EndDate,
Product,
IdentifierCode,
OtherField1, OtherField2, OtherField3, OtherField4,
Function1, Function2, Function3
INTO
NewTable
FROM
OldTable
WHERE
GETDATE() BETWEEN StartDate AND EndDate
CodePudding user response:
First, unless you're either loading data once and never touching it again or are truncating NewTable before each load of a new date range, your approach will not work. ROW_NUMBER will restart at 1 and violate the primary key.
Even if you ARE truncating the table or only loading once ever, there is still a better way. Designate IdentifierCode_ID
as an Identity column and SQL will take care of it for you. If the type is INT and IDENTITY is set, SQL will automatically add 1 to the last value when inserting a new row, you don't even have to assign it!
CREATE TABLE dbo.NewTable(
[IdentifierCode_ID] int IDENTITY(1,1) NOT NULL,
[IdentifierCode] VARCHAR(6) NOT NULL,
...
Also, make sure you consider what you'll do if you accidentally select an overlapping date range for subsequent loads and if values in the OldTable change - for example, add a restriction to the WHERE clause to exclude existing IdentifierCode
values from the insert, and add a second query to update existing IdentifierCode
values that have a different LoadDate
, StartDate
, etc.
...
AND NOT EXISTS (SELECT * FROM NewTable as N WHERE N.IdentifierCode = OldTable.IdentifierCode)
For updating existing rows that changed, you can do an INNER JOIN to select only existing rows and a WHERE clause for only rows that changed.
UPDATE NewTable
SET LoadDate = O.LoadDate, StartDate = O.StartDate, ... --don't forget to recalculate the functions!
FROM NewTable as N INNER JOIN OldTable as O on N.IdentifierCode = O.IdentifierCode
WHERE GETDATE() between O.StartDate and O.EndDate
AND NOT (N.StartDate = O.StartDate and N.EndDate = O.EndDate ... )