Home > Software design >  MongoDB - duplicate documents removal
MongoDB - duplicate documents removal

Time:07-04

Context: I have a MongoDB database with some duplicated documents.

Problem: I want to remove all duplicated documents. (For each duplicated document, I only want to save one, which can be arbitrarily chosen.)

Minimal illustrative example:

The documents all have the following fields (there are also other fields, but those are of no relevance here):

{
    "_id": {"$oid":"..."},
    "name": "string",
    "user": {"$oid":"..."},
}

Duplicated documents: A document is considered duplicated if there are two or more documents with the same "name" and "user" (i.e. the document id is of no relevance here).

How can I remove the duplicated documents?

CodePudding user response:

EDIT: Since mongoDB version 4.2, one option is to use $group and $merge In order to move all unique documents to a new collection:

removeList = db.collection.aggregate([
  {
    $group: {
      _id: {name: "$name", user: "$user"},
      doc: {$first: "$$ROOT"}
    }
  },
  {$replaceRoot: {newRoot: "$doc"}},
  {$merge: {into: "newCollection"}}
])

See how it works on the playground example

For older version, you do the same using $out.

Another option is to get a list of all documents to remove and remove them with another query:

db.collection.aggregate([
  {
    $group: {
      _id: {name: "$name", user: "$user"},
      doc: {$first: "$$ROOT"},
      remove: {$push: "$_id"}
    }
  },
  {
    $set: {
      remove: {
        $filter: {
          input: "$remove",
          cond: {$ne: ["$$this", "$doc._id"]}
        }
      }
    }
  },
  {$group: {_id: 0, remove: { $push: "$remove"}}},
  {$set: { _id: "$$REMOVE",
     remove: {
        $reduce: {
          input: "$remove",
          initialValue: [],
          in: {$concatArrays: ["$$value", "$$this"]}
        }
      }
    }
  }
])

db.collection.deleteMany({_id: {$in: removeList}}) 
  • Related