I have a collection that can contain several million documents, for simplicity, lets say they look like this:
{'_id': '1', 'user_id': 1, 'event_type': 'a', 'name': 'x'}
{'_id': '2', 'user_id': 1, 'event_type': 'b', 'name': 'x'}
{'_id': '3', 'user_id': 1, 'event_type': 'c', 'name': 'x'}
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
{'_id': '8', 'user_id': 4, 'event_type': 'a', 'name': 'x'}
{'_id': '9', 'user_id': 4, 'event_type': 'b', 'name': 'x'}
{'_id': '10', 'user_id': 4, 'event_type': 'c', 'name': 'x'}
I want to have a daily job that runs and deletes all documents by user_id, if the user_id has a doc with event_type 'c'
So the resulting collection will be
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
I did it successfully with mongoshell like this
var cur = db.my_collection.find({'event_type': 'c'})
ids = [];
while (cur.hasNext()) {
ids.push(cur.next()['user_id']);
if (ids.length == 5){
print('deleting for user_ids', ids);
print(db.my_collection.deleteMany({user_id: {$in: ids}}));
ids = [];
}
}
if (ids.length){db.my_collection.deleteMany({user_id: {$in: ids}})}
Created a cursor to hold all docs with event_type 'c', grouped them into batches of 5 then deleted all docs with these ids.
It works but looks very slow, like each cur.next()
only gets one doc at a time.
I wanted to know if there is a better or more correct way to achieve this, if it was elasticsearch I would create a sliced scroll, scan each slice in parallel and submit parallel deleteByQuery requests with 1000 ids each. Is something like this possible/preferable with mongo?
Scale wise I expect there to be several million docs (~10M) at the collection, 300K docs that match the query, and ~700K that should be deleted
CodePudding user response:
It sounds like you can just use deleteMany
with the original query:
db.my_collection.deleteMany({
event_type: 'c'
})
No size limitations on it, it might just take a couple of minutes to run depending on instance size.
EDIT:
I would personally try to use the distinct
function, it's the cleanest and easiest code. distinct does have a 16mb limit about 300k~ unique ids a day (depending on userid field size) sounds a bit close to the threshold, or past it.
const userIds = db.my_collection.distinct('user_id', { event_type: 'c'});
db.my_collection.deleteMany({user_id: {$in: userIds}})
Assuming you except scale to increase, or this fails your tests then the best way is to use something similar to your approach, just in much larger batches. for example:
const batchSize = 100000;
const count = await db.my_collection.countDocuments({'event_type': 'c'});
let iteration = 0;
while (iteration * batchSize < count) {
const batch = await db.my_collection.find({'event_type': 'c'}, { projection: { user_id: 1}}).limit(batchSize).toArray();
if (batch.length === 0) {
break
}
await db.my_collection.deleteMany({user_id: {$in: batch.map(v => v.user_id)}});
iteration
}