Home > Blockchain >  mongodb - upsert causing E11000 duplicate key error
mongodb - upsert causing E11000 duplicate key error

Time:10-24

The upsert statement below does not work and get:

MongoServerError: E11000 duplicate key error collection: db.emails index: _id_ dup key: { _id: "8hh58975fw" }

My goal is to edit an existing document ( or create if exists ) with the same _id The code in api is:

await db.collection('emails')
.updateOne(
  {
  _id: userId,
  type: "profileCompletion",
  time: new Date(),
  },
  {
    $inc: { "count": 1},
  },
  {upsert: true},
)

But I coped it from this which works:

await db.collection('analytics')
.updateOne(
  {
  _id: getDateMonYear(new Date(), "mmddyyyy"),
  type: "Geo",
  },
  {
    $inc: { "count": 1},
  },
  {upsert: true},
)

CodePudding user response:

The solution was to remove:

time: new Date(),

from the query:

await db.collection('emails')
.updateOne(
  {
  _id: userId,
  type: "profileCompletion",
  time: new Date(),
  },
  {
    $inc: { "count": 1},
  },
  {upsert: true},
)

That was indeed the difference between the working and non working queries in my question.

Summary: Upsert with same _id is possible no matter what the _id is as long as there is a unique index. In my case, _id was being derived and was the same plus it was the unique index.

The link posted by the user who commented above is valid but the comments about multi threads etc are not relevant for this question.

CodePudding user response:

As @Kal shared in the answer, the main issue in this particular question was associated with the usage of time: new Date(). This answer is intended to help explore why that is the case, help clarify how concurrency could still be a concern with the adjusted update in that answer, and what a safer update operation may look like.

Upsert Modifications

When a new document is inserted as the result of an upsert, there are a few ways the new document gets constructed. This is currently documented here. Relevant to this situation is the "Set" behavior (due to the change being defined as { $inc: { "count": 1}, }) which uses an update operator. The relevant paragraph:

If no document matches the query criteria and the <update> parameter is a document with update operator expressions, then the operation creates a base document from the equality clauses in the <query> parameter and applies the expressions from the <update> parameter.

So given a (new) userId of 123, an upsert executed on January 1st would query the collection looking for a document that matches the following:

  {
    _id: 123,
    time: ISODate("2022-01-01..."),
    type: 'profileCompletion'
  }

Finding none, the upsert behavior would trigger and the operation would insert a document similar to the following into the collection:

  {
    _id: 123,
    time: ISODate("2022-01-01..."),
    type: 'profileCompletion',
    count: 1
  }

Now we move on to the what happens on a subsequent execution of the operation.

Upsert Matching

Let's say that the same operation (with userId of 123 is executed a month later at the beginning of February. The collection would be searched for a document that matches the following:

  {
    _id: 123,
    time: ISODate("2022-02-01..."),
    type: 'profileCompletion'
  }

While a document with the given _id (and type) does exist, the overall document does not match due to the difference in time. Similar to before, this triggers the upsert behavior and the operation attempts to insert the following document:

  {
    _id: 123,
    time: ISODate("2022-02-01"),
    type: 'profileCompletion',
    count: 1
  }

This of course fails with the aforementioned duplicate key error due to the existing document that contains the _id value of 123. Stated another way, one of the problems is the fact that the timestamp is different every time this operation is executed.

Therefore the change to remove time: new Date() from the <query> portion of the operation does two things at the same time:

  1. It eliminates the time field from being present in the document that gets originally inserted.
  2. It subsequently removes that component of the <query> when matching resulting in the existing document being updated as expected.

Concurrency

I was (obviously) incorrect when I claimed in the comments that the problem that you were experiencing was the result of concurrency. More correctly I should have stated that it may be the result of concurrency. This remains true after the change to remove time from the operation.

The documentation about upsert includes the following note:

If multiple, identical upserts are issued at roughly the same time, it is possible for update() used with upsert: true to create duplicate documents.

It then links to another section that begins with:

When using the upsert: true option with the update() method, and not using a unique index on the query field(s), multiple instances of a update() operation with similar query field(s) could result in duplicate documents being inserted in certain circumstances.

I originally pointed to that documentation as "proof" that concurrency can be an issue. I believe @Kal originally correctly pointed out that the following text at the end of the section implies that the second operation should update as expected (emphasis added):

With this unique index in place, the multiple update() operations now exhibit the following behavior:

  • Exactly one update() operation will successfully insert a new document.
  • All other update() operations will update the newly-inserted document, incrementing the score value.

This was news to me! I went back in time and took a look at how that section was worded for version 4.0 (emphasis added again):

The remaining operations would either:

  • update the newly inserted document, or
  • fail when they attempted to insert a duplicate. If the operation fails because of a duplicate index key error, applications may retry the operation which will succeed as an update operation.

That text changed for version 4.2. After digging around a bit I was able to find this improvement request titled "Retry on predicate unique index violations of update upsert -> insert when possible". The Issue Description suggests that there were some improvements made to this behavior in some situations for 4.2, more on this in the following section.

Alternative (Safer) Rewrite

The aforementioned improvement shows that the server can now transparently retry an upsert that encounters a duplicate key exception under certain conditions. There is a table in that ticket that lays out what those conditions are, but they look to me like the ones that would be the result of concurrency between writes as opposed to other causes of the error including the time: new Date() problem in this particular question.

In fact, we can see that even the modified operation in the other answer will not transparently retry if it encountered a duplicate key exception. If I understand correctly, the modified operation would now have a <query> of { _id: userId, type: "profileCompletion" }. This falls under row 6 described in that table as type is not part of the unique index definition.

In any case, while the modified operation no longer generates the duplicate key exception without concurrency, you may still wish to rewrite the operation. You can absolutely still add a timestamp to the document when it is inserted, but it shouldn't be specified in the <query> portion of the operation. Rather, it should be expressed either via $set or $setOnInsert as part of the described change, depending on exactly what behavior you want when the operation performs an update

Something along these lines:

await db.collection('emails')
.updateOne(
  {
  _id: userId
  },
  {
    $setOnInsert: 
      {
        type: "profileCompletion",
        time: new Date()
      },
    $inc: { "count": 1},
  },
  {upsert: true},
)

It feels like the documentation is a bit misleading so I'll send some feedback on the site.

  • Related