Home > Back-end >  Is my effort necessary and is my approach creating a suitable primary key?
Is my effort necessary and is my approach creating a suitable primary key?

Time:09-01

I am trying to create a dimension table (NewTable) from an existing Data Warehouse table (OldTable) that doesn't have a primary key.

The OldTable holds distinct values in [IdentifierCode] and other values repeat around it. I also need to invoke 3 functions to add reporting context.

I want IdentifierCode_ID to be an INT column - as the [IdentifierCode] column is VARCHAR(6).

My question is this: is using ROW_NUMBER() (as shown below) producing a suitably unique value?

My concern is that the row order on the live table could change if other rows are inserted to remediate missed codes.

Edit: OldTable has 500k rows in all and 227k when filtered with the WHERE clause

SELECT
    ROW_NUMBER() OVER (ORDER BY LoadDate, StartDate, Product, IdentifierCode) AS IdentifierCode_ID,
    LoadDate, 
    StartDate,
    EndDate,
    Product,
    IdentifierCode,
    OtherField1, OtherField2, OtherField3, OtherField4,
    Function1, Function2, Function3
INTO 
    NewTable
FROM 
    OldTable
WHERE
    GETDATE() BETWEEN StartDate AND EndDate

CodePudding user response:

First, unless you're either loading data once and never touching it again or are truncating NewTable before each load of a new date range, your approach will not work. ROW_NUMBER will restart at 1 and violate the primary key.

Even if you ARE truncating the table or only loading once ever, there is still a better way. Designate IdentifierCode_ID as an Identity column and SQL will take care of it for you. If the type is INT and IDENTITY is set, SQL will automatically add 1 to the last value when inserting a new row, you don't even have to assign it!

CREATE TABLE dbo.NewTable(
    [IdentifierCode_ID] int IDENTITY(1,1) NOT NULL,
    [IdentifierCode] VARCHAR(6) NOT NULL,
...

Also, make sure you consider what you'll do if you accidentally select an overlapping date range for subsequent loads and if values in the OldTable change - for example, add a restriction to the WHERE clause to exclude existing IdentifierCode values from the insert, and add a second query to update existing IdentifierCode values that have a different LoadDate, StartDate, etc.

...
AND NOT EXISTS (SELECT * FROM NewTable as N WHERE N.IdentifierCode = OldTable.IdentifierCode)

For updating existing rows that changed, you can do an INNER JOIN to select only existing rows and a WHERE clause for only rows that changed.

UPDATE NewTable
SET LoadDate = O.LoadDate, StartDate = O.StartDate, ... --don't forget to recalculate the functions!
FROM NewTable as N INNER JOIN OldTable as O on N.IdentifierCode = O.IdentifierCode
WHERE GETDATE() between O.StartDate and O.EndDate
  AND NOT (N.StartDate = O.StartDate and N.EndDate = O.EndDate ... )
  • Related