Home > other >  MongoDB - Best way to delete documents by query based on results of another query
MongoDB - Best way to delete documents by query based on results of another query

Time:10-28

I have a collection that can contain several million documents, for simplicity, lets say they look like this:

{'_id': '1', 'user_id': 1, 'event_type': 'a', 'name': 'x'}
{'_id': '2', 'user_id': 1, 'event_type': 'b', 'name': 'x'}
{'_id': '3', 'user_id': 1, 'event_type': 'c', 'name': 'x'}
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
{'_id': '8', 'user_id': 4, 'event_type': 'a', 'name': 'x'}
{'_id': '9', 'user_id': 4, 'event_type': 'b', 'name': 'x'}
{'_id': '10', 'user_id': 4, 'event_type': 'c', 'name': 'x'}

I want to have a daily job that runs and deletes all documents by user_id, if the user_id has a doc with event_type 'c'

So the resulting collection will be

{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}

I did it successfully with mongoshell like this

var cur = db.my_collection.find({'event_type': 'c'})
ids = [];
while (cur.hasNext()) {
  ids.push(cur.next()['user_id']);
  if (ids.length == 5){
    print('deleting for user_ids', ids);
    print(db.my_collection.deleteMany({user_id: {$in: ids}}));
    ids = [];
  }
}
if (ids.length){db.my_collection.deleteMany({user_id: {$in: ids}})}

Created a cursor to hold all docs with event_type 'c', grouped them into batches of 5 then deleted all docs with these ids.

It works but looks very slow, like each cur.next() only gets one doc at a time.

I wanted to know if there is a better or more correct way to achieve this, if it was elasticsearch I would create a sliced scroll, scan each slice in parallel and submit parallel deleteByQuery requests with 1000 ids each. Is something like this possible/preferable with mongo?

Scale wise I expect there to be several million docs (~10M) at the collection, 300K docs that match the query, and ~700K that should be deleted

CodePudding user response:

It sounds like you can just use deleteMany with the original query:

db.my_collection.deleteMany({
    event_type: 'c'
})

No size limitations on it, it might just take a couple of minutes to run depending on instance size.


EDIT:

I would personally try to use the distinct function, it's the cleanest and easiest code. distinct does have a 16mb limit about 300k~ unique ids a day (depending on userid field size) sounds a bit close to the threshold, or past it.

const userIds = db.my_collection.distinct('user_id', { event_type: 'c'});
db.my_collection.deleteMany({user_id: {$in: userIds}})

Assuming you except scale to increase, or this fails your tests then the best way is to use something similar to your approach, just in much larger batches. for example:

const batchSize = 100000;
const count = await db.my_collection.countDocuments({'event_type': 'c'});
let iteration = 0;

while (iteration * batchSize < count) {
    const batch = await db.my_collection.find({'event_type': 'c'}, { projection: { user_id: 1}}).limit(batchSize).toArray();
    if (batch.length === 0) {
        break
    }
    await db.my_collection.deleteMany({user_id: {$in: batch.map(v => v.user_id)}});
    iteration  
}
  • Related