Home > Software design >  Compare duplicates by Score By Not Max Field or Earlier Date
Compare duplicates by Score By Not Max Field or Earlier Date

Time:08-18

I need to compare duplicated documents and get the duplicated ones with the Lowest Score. If the Score between two duplicates is Equal, then get the one with earlier date.

{
 "_id": UUID("c77c72de-edd8-4576-a72c-983cf93a0f31"),
 "DocumentId": "05240423067",
 "Name": "John Doe",
 "CreationDate": ISODate("0001-01-01T00:00:00.000 00:00"),
 "Score": 5,
},
{
 "_id": UUID("b5a7d404-a341-45dd-b875-864cd1e6bda2"),
 "DocumentId": "05240423067",
 "Name": "John Doe",
 "CreationDate": ISODate("2021-07-17T00:00:00.000 00:00"),
 "Score": 2
},
{
 "_id": UUID("9efddd23-4b6b-4e96-ab43-b24a080107db"),
 "DocumentId": "05240423067",
 "Name": "John Doe",
 "CreationDate": ISODate("2021-07-10T00:00:00.000 00:00"),
 "Score": 2
},
{
 "_id": UUID("f1a063a5-f9dd-4998-b6aa-df2071dd8677"),
 "DocumentId": "88313825863",
 "Name": "Marcus Joseph",
 "CreationDate": ISODate("2021-07-17T00:00:00.000 00:00"),
 "Score": 2
},
{
 "_id": UUID("e3262f8e-bd6a-49e8-abe5-c3c1a4e49900"),
 "DocumentId": "88313825863",
 "Name": "Marcus Joseph",
 "CreationDate": ISODate("0001-01-01T00:00:00.000 00:00"),
 "Score": 1
}

Later, the resulting documents will be deleted.

Expected Result:

{
 "_id": UUID("b5a7d404-a341-45dd-b875-864cd1e6bda2"),
 "DocumentId": "05240423067",
 "Name": "John Doe",
 "CreationDate": ISODate("2021-07-17T00:00:00.000 00:00"),
 "Score": 2 // Return Documents with the **Lowest Score**
},
{
 "_id": UUID("9efddd23-4b6b-4e96-ab43-b24a080107db"),
 "DocumentId": "05240423067",
 "Name": "John Doe",
 "CreationDate": ISODate("2021-07-10T00:00:00.000 00:00"),
 "Score": 2 // Return Documents with the **Lowest Score**
},
{
 "_id": UUID("e3262f8e-bd6a-49e8-abe5-c3c1a4e49900"),
 "DocumentId": "88313825863",
 "Name": "Marcus Joseph",
 "CreationDate": ISODate("0001-01-01T00:00:00.000 00:00"),
 "Score": 2 // If both Scores Equal, Compare CreationDate earlier
}

Mongo Version 4.2.21

CodePudding user response:

This would be easier with some of the newer "$group" accumulators introduced in more recent versions of MongoDB, but here's one way you could do it.

db.collection.aggregate([
  {
    "$group": {
      "_id": "$DocumentId",
      "count": {"$sum": 1},
      "docs": {"$push": "$$ROOT"}
    }
  },
  { // if only 1, keep it
    "$match": {
      "$expr": {"$gt": ["$count", 1]}
    }
  },
  { // find the doc to keep
    "$set": {
      "keepDoc": {
        "$reduce": {
          "input": "$docs",
          "initialValue": {
            "Score": {"$minKey": 1}
          },
          "in": {
            "$switch": {
              "branches": [
                {
                  "case": {"$gt": ["$$this.Score", "$$value.Score"]},
                  "then": "$$this"
                },
                {
                  "case": {"$eq": ["$$this.Score", "$$value.Score"]},
                  "then": {
                    "$cond": [
                      {"$gt": ["$$this.CreationDate", "$$value.CreationDate"]},
                      "$$this",
                      "$$value"
                    ]
                  }
                }
              ],
              "default": "$$value"
            }
          }
        }
      }
    }
  },
  { // get docs other than keepDoc
    "$project": {
      "_id": 0,
      "expiredDocs": {
        "$filter": {
          "input": "$docs",
          "cond": {"$ne": ["$$this", "$keepDoc"]}
        }
      }
    }
  },
  {"$unwind": "$expiredDocs"},
  {"$replaceWith": "$expiredDocs"}
])

Try it on mongoplayground.net.

N.B.: On mongoplayground.net, there's no easy way that I know of to enter binary UUID values in the BSON configuration, so I just used strings. It should be inconsequential to the pipeline.

  • Related