I have collection A with N numbers of documents. My collection look slike this:
{
"_id": "61721b17e52d6033c444059d",
"advertising_venue": "GAP Store, 1440 W Taylor st",
"ad_shelf_name": "11",
"gender": "man",
"age": "25-35",
"distance_to_shelf": "7.035805",
"date": "October 21st 2021 8:59:51 pm",
"user_id": "0.14136775694578052"
},
{
"_id": "61721b18e52d6033c444059e",
"advertising_venue": "GAP Store, 1440 W Taylor st",
"ad_shelf_name": "11",
"gender": "man",
"age": "25-35",
"distance_to_shelf": "8.065434999999999",
"date": "October 21st 2021 8:59:52 pm",
"user_id": "0.14136775694578052"
},
{
"_id": "61721b19e52d6033c444059f",
"advertising_venue": "GAP Store, 1440 W Taylor st",
"ad_shelf_name": "11",
"gender": "man",
"age": "25-35",
"distance_to_shelf": "10.124695",
"date": "October 21st 2021 8:59:53 pm",
"user_id": "0.14136775694578052"
}
I want to compare each document by value user_id and if it is similar want to remove one of those documents, if it is not similar then it stays in collections as it is.
If is possible to do in MongoDB?
CodePudding user response:
It can be achieved by creating unique index with dropDups:true on user_id.
db.collection.ensureIndex({user_id: 1}, {unique: true, dropDups: true})
CodePudding user response:
When you say
if it is similar
This has a particular meaning when talking about strings. If you want to delete all documents with identical user_id
fields, that can be done.
If you want to delete all documents with almost the same, but a slightly different user_id
, then no, that cannot be done with mongodb directly, and you will have to solve that another way.
Assuming you want to delete documents with identical user_id
fields, you may want to consider which document you want to keep and which one you want to delete.
Assuming you want to keep only the first copy of each, you can do so by creating a unique index on the user_id
field with the option dropDups
set to true. Mongodb will then scan the collection on disk and index each user_id
. As it comes across any documents which are duplicates it will delete them.
db.mycollection.ensureIndex({'user_id' : 1}, {unique : true, dropDups : true})
However, if you want to delete documents based on some other kind of logic, say you want to keep the newest document, or maybe the document with the lowest distance_to_shelf
you will need to first query your data, sorting by the criteria which makes certain records more valuable, and then delete all documents with identical user_id
fields that do not have the same _id
.