How can a constraint be violated when the SQL query excludes it?-CodePudding

I have a result table that holds the output of a large, complicated, slow running query.

It's defined something like:

create table ResultsStore (
    Foo int not null,
    Bar int not null,
    ... other fields
    constraint [PK_ResultsStore_foo_bar] primary key clustered
    (
        Foo asc,
        Bar asc
    )
)

I then insert to this table with a query like:

insert ResultsStore (Foo, Bar) 
output inserted.* 
select subQuery.ID, @bar
from ( 
    -- large complex slow query
    ) subQuery
where subQuery.ID not in ( 
        select Foo 
        from ResultsStore 
        where Bar = @bar 
    )

In testing this is fine, but in production, with lots of users hitting it regularly, we often get an exception:

Violation of PRIMARY KEY constraint 'PK_ResultsStore_foo_bar'. Cannot insert duplicate key in object 'ResultsStore'.

How is this possible? Surely the where should exclude any combination of the multiple primary key fields where they are already in the table?

How to best avoid this?

CodePudding user response：

As written two sessions can run the query, both checking for the existence of the row concurrently, both not finding it, then both proceeding to attempt the insert. The first one will succeed in READ COMMITED, and the second one will fail.

You need WITH (UPDLOCK, HOLDLOCK, ROWLOCK) on the subquery to avoid this race condition. At default read committed isolation level either S locks taken by the sub query or row versioning is used and no locks at all are taken.

The HOLDLOCK gives serializable semantics and protects the range. UPDLOCK forces the read to use a U lock which will block other sessions from reading with UPDLOCK.

CodePudding user response：

You can also use a temp table to hold interim results and perform the final insert at the end.

The following also includes a DISTINCT (which might or might not be needed), changes the dup test to use EXISTS, and applies WITH (UPDLOCK, HOLDLOCK, ROWLOCK) options to the final insert as suggested by others.

declare @TempResults table (
    Foo int not null,
    Bar int not null
)

insert @TempResults
select distinct subQuery.ID, @bar
from ( 
    -- large complex slow query
    ) subQuery

insert ResultsStore (Foo, Bar) 
output inserted.* 
select T.Foo, T.Bar
from @TempResults T
where not exists ( 
   select * 
   from ResultsStore RS
   where RS.Foo = T.Foo
   and RS.Bar = T.Bar
)
WITH (UPDLOCK, HOLDLOCK, ROWLOCK)

This lets your long running query run fast and dirty (as you intend), but should maintain integrity and minimize actual lock duration for the final insert.