Home > Software design >  MongoDB- How to remove duplicate documents?
MongoDB- How to remove duplicate documents?

Time:10-08

I am trying to create a script to remove duplicates from a collection using the MongoDB aggregation framework.

Input: some of the documents in the collection

{"_id":{"$oid":"6122ca5df5ee9d5974c6e886"},"Name":"ccc","cusId":1}
{"_id":{"$oid":"6122ca5df5ee9d5974c6e887"},"Name":"ddd","cusId":4}
{"_id":{"$oid":"6122ca5df5ee9d5974c6e888"},"Name":"eee","cusId":5}
{"_id":{"$oid":"612bc88610921de4b5a6bd0d"},"Name":"last","cusId":1}
{"_id":{"$oid":"61571141c8cdee23c9a46a3a"},"Name":"lastRecord","cusId":1}

Script to find duplicates except the last inserted document

$group
{ 
     _id:  "$cusId", 
             
            mongoIds: { $addToSet: "$_id" },
            count: { $sum: 1 } 
 },$match
{
  count: { $gt: 1 }
},$project
{
   
   "duplicates":{
      "$filter":{
         "input":"$mongoIds",
         "as":"mongoId",
         "cond":{
            "$lt":[
               "$$mongoId",
               {
                  "$max":"$mongoIds"
               }
            ]
         }
      }
   }
}

the expected results

the expected output of the script is the array duplicates to contain only documents that are duplicated except the last inserted document to the array.

for instance, for documents with cusId 4 and 5, there are no duplicates for this cusId so they should not be included in the duplicates array. but for documents with cusId 1, there should be 2 duplicates (all documents with cusId 1 except the last inserted document)

so the final result should be all duplicates except the last inserted document

{"_id":{"$oid":"6122ca5df5ee9d5974c6e886"},"Name":"ccc","cusId":1}
{"_id":{"$oid":"612bc88610921de4b5a6bd0d"},"Name":"last","cusId":1}

how to remove duplicates where cusId is the same and keep only the latest inserted document in the collection? How can I get document creation time?

CodePudding user response:

You can do the followings in an aggregation pipeline:

  1. $lookup self collection with a subpipeline
  2. In the subpipeline, do a $group to get the latest timestamp through $toDate : "$_id"
  3. filter the root document by $ne to the timestamp of the $lookup result (i.e. not the latest document)
  4. $out the result to another collection. You can then use this output collection to prune your original collection through a simple $merge

Here is the Mongo playground for your reference.

  • Related