Home > Software design >  Why does identical code for SQL Merge (Upsert) work in Microsoft SQL Server console but doesn't
Why does identical code for SQL Merge (Upsert) work in Microsoft SQL Server console but doesn't

Time:07-02

I have a function in my main Python file which gets called by main() and executes a SQL Merge (Upsert) statement using pyodbc from a different file & function. Concretely, the SQL statement traverses a source table with transaction details by distinct transaction datetimes and merges customers into a separate target table. The function that executes the statement and the function that returns the completed SQL statement are attached below.

When I run my Python script, it doesn't work as expected and inserts only around 70 rows (sometimes 69, 71, or 72) into the target customer table. However, when I use an identical SQL statement and execute it in the Microsoft SQL Server Management Studio console (attached below), it works fine and inserts 4302 rows (as expected).

I'm not sure what's wrong.. Would really appreciate any help!

SQL Statement Executor in Python main file:

    def stage_to_dim(connection, cursor, now):
        log(f"Filling {cfg.dim_customer} and {cfg.dim_product}")
        try:
            cursor.execute(sql_statements.stage_to_dim_statement(now))
            connection.commit()
        except Exception as e:
            log(f"Error in stage_to_dim: {e}" )
            sys.exit(1) 
        log("Stage2Dimensions complete.")

SQL Statement formulator in Python:

    def stage_to_dim_statement(now):
        return f"""
        DECLARE @dates table(id INT IDENTITY(1,1), date DATETIME)
        INSERT INTO @dates (date)
            SELECT DISTINCT TransactionDateTime FROM {cfg.stage_table} ORDER BY TransactionDateTime;

        DECLARE @i INT;
        DECLARE @cnt INT;
        DECLARE @date DATETIME;

        SELECT @i = MIN(id) - 1, @cnt = MAX(id) FROM @dates;
        WHILE @i < @cnt
        BEGIN
            SET @i = @i   1
            SET @date = (SELECT date FROM @dates WHERE id = @i)
            MERGE {cfg.dim_customer} AS Target
            USING (SELECT * FROM {cfg.stage_table} WHERE TransactionDateTime = @date) AS Source 
                ON Target.CustomerCodeNK = Source.CustomerID
            WHEN MATCHED THEN
                UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
            Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME, '{now}'), Target.LoadSource = '{cfg.ingest_file_path}'
            WHEN NOT MATCHED THEN
                INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID, 
            Source.AcquisitionDate, Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'{now}'), '{cfg.ingest_file_path}');
        END
    """

SQL Statement from MS SQL Server Console:

    DECLARE @dates table(id INT IDENTITY(1,1), date DATETIME)
    INSERT INTO @dates (date)
        SELECT DISTINCT TransactionDateTime FROM dbo.STG_CustomerTransactions ORDER BY TransactionDateTime;
    DECLARE @i INT;
    DECLARE @cnt INT;
    DECLARE @date DATETIME;
    SELECT @i = MIN(id) - 1, @cnt = MAX(id) FROM @dates;
    WHILE @i < @cnt
    BEGIN
        SET @i = @i   1
        SET @date = (SELECT date FROM @dates WHERE id = @i)
        MERGE dbo.DIM_CustomerDup AS Target
        USING (SELECT * FROM dbo.STG_CustomerTransactions WHERE TransactionDateTime = @date) AS Source 
            ON Target.CustomerCodeNK = Source.CustomerID

        WHEN MATCHED THEN
           UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
           Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME,'6/30/2022 11:53:05'), Target.LoadSource = '../csv/cleaned_original_data.csv'

        WHEN NOT MATCHED THEN
            INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID, Source.AcquisitionDate, 
            Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'6/30/2022 11:53:05'), '../csv/cleaned_original_data.csv');
    END

CodePudding user response:

If you think carefully about what your final result ends up, you are actually just taking the latest row (by date) for each customer. So you can just filter the source using a standard row-number approach.

Exactly why the Python code didn't work properly is unclear, but the below query might work better. You are also doing SQL injection, which is dangerous and can also cause correctness problems.

Also you should always use a non-ambiguous date format.

MERGE dbo.DIM_CustomerDup AS t
USING (
    SELECT *
    FROM (
        SELECT *,
          rn = ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TransactionDateTime DESC)
        FROM dbo.STG_CustomerTransactions s
    ) AS s
    WHERE s.rn = 1
) AS s
ON t.CustomerCodeNK = s.CustomerID

WHEN MATCHED THEN
    UPDATE SET
      AquiredDate = s.AcquisitionDate,
      AquiredSource = s.AcquisitionSource,
      ZipCode = s.Zipcode,
      LoadDate = SYSDATETIME(),
      LoadSource = '../csv/cleaned_original_data.csv'

WHEN NOT MATCHED THEN
    INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource)
    VALUES (s.CustomerID, s.AcquisitionDate, s.AcquisitionSource, s.Zipcode, SYSDATETIME(), '../csv/cleaned_original_data.csv')
;
  • Related