The upsert statement below does not work and get:
MongoServerError: E11000 duplicate key error collection: db.emails index: _id_ dup key: { _id: "8hh58975fw" }
My goal is to edit an existing document ( or create if exists ) with the same _id
The code in api is:
await db.collection('emails')
.updateOne(
{
_id: userId,
type: "profileCompletion",
time: new Date(),
},
{
$inc: { "count": 1},
},
{upsert: true},
)
But I coped it from this which works:
await db.collection('analytics')
.updateOne(
{
_id: getDateMonYear(new Date(), "mmddyyyy"),
type: "Geo",
},
{
$inc: { "count": 1},
},
{upsert: true},
)
CodePudding user response:
The solution was to remove:
time: new Date(),
from the query:
await db.collection('emails')
.updateOne(
{
_id: userId,
type: "profileCompletion",
time: new Date(),
},
{
$inc: { "count": 1},
},
{upsert: true},
)
That was indeed the difference between the working and non working queries in my question.
Summary:
Upsert with same _id is possible no matter what the _id is as long as there is a unique index. In my case, _id
was being derived and was the same plus it was the unique index.
The link posted by the user who commented above is valid but the comments about multi threads etc are not relevant for this question.
CodePudding user response:
As @Kal shared in the answer, the main issue in this particular question was associated with the usage of time: new Date()
. This answer is intended to help explore why that is the case, help clarify how concurrency could still be a concern with the adjusted update
in that answer, and what a safer update
operation may look like.
Upsert Modifications
When a new document is inserted as the result of an upsert
, there are a few ways the new document gets constructed. This is currently documented here. Relevant to this situation is the "Set" behavior (due to the change being defined as { $inc: { "count": 1}, }
) which uses an update operator. The relevant paragraph:
If no document matches the query criteria and the
<update>
parameter is a document with update operator expressions, then the operation creates a base document from the equality clauses in the<query>
parameter and applies the expressions from the<update>
parameter.
So given a (new) userId
of 123
, an upsert
executed on January 1st would query the collection looking for a document that matches the following:
{
_id: 123,
time: ISODate("2022-01-01..."),
type: 'profileCompletion'
}
Finding none, the upsert
behavior would trigger and the operation would insert a document similar to the following into the collection:
{
_id: 123,
time: ISODate("2022-01-01..."),
type: 'profileCompletion',
count: 1
}
Now we move on to the what happens on a subsequent execution of the operation.
Upsert Matching
Let's say that the same operation (with userId
of 123
is executed a month later at the beginning of February. The collection would be searched for a document that matches the following:
{
_id: 123,
time: ISODate("2022-02-01..."),
type: 'profileCompletion'
}
While a document with the given _id
(and type
) does exist, the overall document does not match due to the difference in time
. Similar to before, this triggers the upsert
behavior and the operation attempts to insert the following document:
{
_id: 123,
time: ISODate("2022-02-01"),
type: 'profileCompletion',
count: 1
}
This of course fails with the aforementioned duplicate key error due to the existing document that contains the _id
value of 123
. Stated another way, one of the problems is the fact that the timestamp is different every time this operation is executed.
Therefore the change to remove time: new Date()
from the <query>
portion of the operation does two things at the same time:
- It eliminates the
time
field from being present in the document that gets originally inserted. - It subsequently removes that component of the
<query>
when matching resulting in the existing document being updated as expected.
Concurrency
I was (obviously) incorrect when I claimed in the comments that the problem that you were experiencing was the result of concurrency. More correctly I should have stated that it may be the result of concurrency. This remains true after the change to remove time
from the operation.
The documentation about upsert includes the following note:
If multiple, identical upserts are issued at roughly the same time, it is possible for
update()
used withupsert: true
to create duplicate documents.
It then links to another section that begins with:
When using the
upsert: true
option with theupdate()
method, and not using a unique index on the query field(s), multiple instances of aupdate()
operation with similar query field(s) could result in duplicate documents being inserted in certain circumstances.
I originally pointed to that documentation as "proof" that concurrency can be an issue. I believe @Kal originally correctly pointed out that the following text at the end of the section implies that the second operation should update
as expected (emphasis added):
With this unique index in place, the multiple
update()
operations now exhibit the following behavior:
- Exactly one
update()
operation will successfully insert a new document.- All other
update()
operations will update the newly-inserted document, incrementing the score value.
This was news to me! I went back in time and took a look at how that section was worded for version 4.0 (emphasis added again):
The remaining operations would either:
- update the newly inserted document, or
- fail when they attempted to insert a duplicate. If the operation fails because of a duplicate index key error, applications may retry the operation which will succeed as an update operation.
That text changed for version 4.2
. After digging around a bit I was able to find this improvement request titled "Retry on predicate unique index violations of update upsert -> insert when possible". The Issue Description suggests that there were some improvements made to this behavior in some situations for 4.2
, more on this in the following section.
Alternative (Safer) Rewrite
The aforementioned improvement shows that the server can now transparently retry an upsert
that encounters a duplicate key exception under certain conditions. There is a table in that ticket that lays out what those conditions are, but they look to me like the ones that would be the result of concurrency between writes as opposed to other causes of the error including the time: new Date()
problem in this particular question.
In fact, we can see that even the modified operation in the other answer will not transparently retry if it encountered a duplicate key exception. If I understand correctly, the modified operation would now have a <query
> of { _id: userId, type: "profileCompletion" }
. This falls under row 6 described in that table as type
is not part of the unique index definition.
In any case, while the modified operation no longer generates the duplicate key exception without concurrency, you may still wish to rewrite the operation. You can absolutely still add a timestamp to the document when it is inserted, but it shouldn't be specified in the <query>
portion of the operation. Rather, it should be expressed either via $set
or $setOnInsert
as part of the described change, depending on exactly what behavior you want when the operation performs an update
Something along these lines:
await db.collection('emails')
.updateOne(
{
_id: userId
},
{
$setOnInsert:
{
type: "profileCompletion",
time: new Date()
},
$inc: { "count": 1},
},
{upsert: true},
)
It feels like the documentation is a bit misleading so I'll send some feedback on the site.