I have a collection with large number of documents (~100Million). I want to get documents which don't exist in the collection from the list of queries I provided. Example:
query_user_ids = ["32432", "32433", "32434", "32435"]
document = {"_id": "xxxx", "user_id": "32433", "details": "xxxx"}
user_id has unique index on it
I want to query which user_ids are not present in the collection. So assuming user_id 32434 and 32435 do not exist, when I query the collection with this list of ids, I should get the response ["32434", "32435"]
Right now, I am just looping over the user_ids and calling find_one to check if document with user_id exists or not, but I suspect this is slowing down the operation. Is there a way I can do it by directly passing in the list of ids.
I am using PyMongo for querying.
CodePudding user response:
Query
- you can do findMany also like bellow
- get all documents that exists
- and then with python, for example with a loop on the list, keep those of the list, that are not in the results
- this way you send only 1 query, and the python time costs nothing even if list is kinda big
*i am not sure that this is the optimal way, but i think it would be faster compared with sending many queries, try it maybe and if you can and give some feedback
find({"user_id": {"$in": ["32432", "32433", "32434", "32435"]}},
{"projection": {"_id": 0, "user_id": 1}})