Is my effort necessary and is my approach creating a suitable primary key?-CodePudding

I am trying to create a dimension table (NewTable) from an existing Data Warehouse table (OldTable) that doesn't have a primary key.

The OldTable holds distinct values in [IdentifierCode] and other values repeat around it. I also need to invoke 3 functions to add reporting context.

I want IdentifierCode_ID to be an INT column - as the [IdentifierCode] column is VARCHAR(6).

My question is this: is using ROW_NUMBER() (as shown below) producing a suitably unique value?

My concern is that the row order on the live table could change if other rows are inserted to remediate missed codes.

Edit: OldTable has 500k rows in all and 227k when filtered with the WHERE clause

SELECT
    ROW_NUMBER() OVER (ORDER BY LoadDate, StartDate, Product, IdentifierCode) AS IdentifierCode_ID,
    LoadDate, 
    StartDate,
    EndDate,
    Product,
    IdentifierCode,
    OtherField1, OtherField2, OtherField3, OtherField4,
    Function1, Function2, Function3
INTO 
    NewTable
FROM 
    OldTable
WHERE
    GETDATE() BETWEEN StartDate AND EndDate

CodePudding user response：

First, unless you're either loading data once and never touching it again or are truncating NewTable before each load of a new date range, your approach will not work. ROW_NUMBER will restart at 1 and violate the primary key.

Even if you ARE truncating the table or only loading once ever, there is still a better way. Designate IdentifierCode_ID as an Identity column and SQL will take care of it for you. If the type is INT and IDENTITY is set, SQL will automatically add 1 to the last value when inserting a new row, you don't even have to assign it!

CREATE TABLE dbo.NewTable(
    [IdentifierCode_ID] int IDENTITY(1,1) NOT NULL,
    [IdentifierCode] VARCHAR(6) NOT NULL,
...

Also, make sure you consider what you'll do if you accidentally select an overlapping date range for subsequent loads and if values in the OldTable change - for example, add a restriction to the WHERE clause to exclude existing IdentifierCode values from the insert, and add a second query to update existing IdentifierCode values that have a different LoadDate, StartDate, etc.

...
AND NOT EXISTS (SELECT * FROM NewTable as N WHERE N.IdentifierCode = OldTable.IdentifierCode)

For updating existing rows that changed, you can do an INNER JOIN to select only existing rows and a WHERE clause for only rows that changed.

UPDATE NewTable
SET LoadDate = O.LoadDate, StartDate = O.StartDate, ... --don't forget to recalculate the functions!
FROM NewTable as N INNER JOIN OldTable as O on N.IdentifierCode = O.IdentifierCode
WHERE GETDATE() between O.StartDate and O.EndDate
  AND NOT (N.StartDate = O.StartDate and N.EndDate = O.EndDate ... )