get document with same 3 fields in a collection-CodePudding

i have a collection with more then 1000 documents and there are some documents with same value in some fields, i need to get those

the collection is:

[{_id,fields1,fields2,fields3,etc...}]

what query can i use to get all the elements that have the same 3 fields for example:

[
 {_id:1,fields1:'a',fields2:1,fields3:'z'},
 {_id:2,fields1:'a',fields2:1,fields3:'z'},
 {_id:3,fields1:'f',fields2:2,fields3:'g'},
 {_id:4,fields1:'f',fields2:2,fields3:'g'},
 {_id:5,fields1:'j',fields2:3,fields3:'g'},
]

i need to get

[
 {_id:2,fields1:'a',fields2:1,fields3:'z'},
 {_id:4,fields1:'f',fields2:2,fields3:'g'},
]

in this way i can easly get a list of "duplicate" that i can delete if needed, it's not really important get id 2 and 4 or 1 and 3

but 5 would never be included as it's not 'duplicated'

EDIT: sorry but i forgot to mention that there are some document with null value i need to exclude those

CodePudding user response：

This is the perfect use case of window field. You can use $setWindowFields to compute $rank in the grouping/partition you want. Then, get those rank not equal to 1 to get the duplicates.

db.collection.aggregate([
  {
    $match: {
      fields1: {
        $ne: null
      },
      fields2: {
        $ne: null
      },
      fields3: {
        $ne: null
      }
    }
  },
  {
    "$setWindowFields": {
      "partitionBy": {
        fields1: "$fields1",
        fields2: "$fields2",
        fields3: "$fields3"
      },
      "sortBy": {
        "_id": 1
      },
      "output": {
        "duplicateRank": {
          "$rank": {}
        }
      }
    }
  },
  {
    $match: {
      duplicateRank: {
        $ne: 1
      }
    }
  },
  {
    $unset: "duplicateRank"
  }
])

Mongo Playground

CodePudding user response：

I think you can try this aggregation query:

First group by the feilds you want to know if there are multiple values.
It creates an array with the _ids that are repeated.
Then get only where there is more than one ($match).
And last project to get the desired output. I've used the first _id found.

db.collection.aggregate([
  {
    "$group": {
      "_id": {
        "fields1": "$fields1",
        "fields2": "$fields2",
        "fields3": "$fields3"
      },
      "duplicatesIds": {
        "$push": "$_id"
      }
    }
  },
  {
    "$match": {
      "$expr": {
        "$gt": [
          {
            "$size": "$duplicatesIds"
          },
          1
        ]
      }
    }
  },
  {
    "$project": {
      "_id": {
        "$arrayElemAt": [
          "$duplicatesIds",
          0
        ]
      },
      "fields1": "$_id.fields1",
      "fields2": "$_id.fields3",
      "fields3": "$_id.fields2"
    }
  }
])

Example here