I am using mongoose to query a really big list from Mongodb
const chat_list = await chat_model.find({}).sort({uuid: 1}); // uuid is a index
const msg_list = await message_model.find({}, {content: 1, xxx}).sort({create_time: 1});// create_time is a index of message collection, time: t1
// chat_list length is around 2,000, msg_list length is around 90,000
compute(chat_list, msg_list); // time: t2
function compute(chat_list, msg_list) {
for (let i = 0, len = chat_list.length; i < len; i ) {
msg_list.filter(msg => msg.uuid === chat_list[i].uuid)
// consistent handling for every message
}
}
for above code, t1 is about 46s, t2 is about 150s
t2 is really to big, so weird.
then I cached these list to local json file,
const chat_list = require('./chat-list.json');
const msg_list = require('./msg-list.json');
compute(chat_list, msg_list); // time: t2
this time, t2 is around 10s.
so, here comes the question, 150 seconds vs 10 seconds, why? what happened?
I tried to use worker to do the compute step after mongo query, but the time is still much bigger than 10s
CodePudding user response:
The mongodb query returns a FindCursor that includes arrayish methods like .filter()
but the result is not an Array.
Use .toArray()
on the cursor before filtering to process the mongodb result set like for like. That might not make the overall process any faster, as the result set still needs to be fetched from mongodb, but compute will be similar.
const chat_list = await chat_model
.find({})
.sort({uuid: 1})
.toArray()
const msg_list = await message_model
.find({}, {content: 1, xxx})
.sort({create_time: 1})
.toArray()
CodePudding user response:
Matt typed faster than I did, so some of what was suggested aligns with part of this answer.
I think you are measuring and comparing something different than what you are expecting and implying.
Your expectation is that the compute()
function takes around 10 seconds once all of the data is loaded by the application. This is (mostly) demonstrated by your second test, apart from the fact that that test includes the time it takes to load the data from the local files. But you're seeing that there is a difference of 104
seconds (150 - 46
) between the completion of message_model.find()
and compute()
hence leading to the question.
The key thing is that successfully advancing from the find
against message_model
is not the same thing as retrieving all of the results. As @Matt notes, the find()
will return with a cursor object once the initial batch of results are ready. That is very different than retrieving all of the results. So there is more work (apparently ~94 seconds worth) left to do from the two find()
operations to further iterate the cursors and retrieve the rest of the results. This additional time is getting reported inside of t2
.
Ass suggested by @Matt, calling .toArray()
should shift that time back into t1
as you are expecting. Also sounds like it may be more correct due to ambiguity with .filter()
functions.
There are two other things that catch my attention. The first is: why are you retrieving all of this data client-side to do the filtering there? Perhaps you would like to do this uuid
matching inside of the database via $lookup
?
Secondly, this comment isn't clear to me:
// create_time is a index of message collection, time: t1
create_time
itself is a field here, existent or not, that you are requesting an ascending sort against.