Here is the query:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.
CodePudding user response:
Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.
When it comes to performance, there is nothing that can be done to improve efficiency on the positive
collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.
To efficiently support the $lookup
on the search_history
collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 }
exists. Providing the .explain("allPlansExecution")
output would allow us to better understand the current performance characteristics.
Desired Logic
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:
- For every single document in the
positive
collection, add a new field calledhistory
. - This new field is a list of
url
values from thesearch_history
collection where the corresponding document has a matchingsearch_id
value and wascreated_at
after last Sunday. - The aggregation then filters to only keep documents where the new
history
field has at least one entry. - The next stage then groups the results together by
word
. The$addToSet
operator is used here, but it may be generating an array of arrays rather than de-duplicatedurl
s. - The final 3 stages of the aggregation seem to be focused on calculating the number of
url
s and returning the top50
results byword
sorted on that size in descending order.
Is this what you want? In particular the following aspects may be worth confirming:
- Is it your intention to process every document in the
positive
collection? This may be the case, but it's impossible to tell without any schema/use-case context. - Is the size calculation of the
url
s correct? It seems like you may need to use a$map
when doing the$addToSet
for the$group
instead of using$reduce
for the subsequent$project
.
CodePudding user response:
The best thing to do is to limit the number of documents passed to each stage. Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.
So the best thing to do is to have a match on an indexed field that is very restrictive.
Moreover, please note that $limit
, $skip
and $sample
are not panaceas because they still scan the entire collection.
A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this :
Once every X requests
- Count the number of docs in the collection
- Divide this in chunks of Yk max
- Find the _ids of the docs at the place Y, 2Y, 3Y etc with skip and limit
- Cache the results in redis/memcache (or as global variable if you really cannot do otherwise)
Every request
- Get the current chunk to scan by reading the redis keys
used
andnbChunks
- Get the _ids cached in redis used to delimit the next aggregation
id:${used%nbChunks}
andid:${(used%nbChunks) 1}
respectively - Aggregate using
$match
with_id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
- Increment
used
, ifused > X
then update chunks
Further optimisation
If using redis, supplement every key with ${cluster.worker.id}:
to avoid hot keys.
Notes
- The step 3) of the setup of chunks can be a really long and intensive process, so do it only when necessary, let's say every X~1k requests.
- If you are scanning the last chunk, do not put the
$lt
- Once this process implemented, your job is to find the sweet spot of X and Y that suits your needs, constrained by a Y being large enough to retrieve max documents while being not too long and a X that keeps the chunks roughly equals as the collection has more and more documents.
- This process is a bit long to implement but once it is, time complexity is ~O(Y) and not ~O(N). Indeed, the
$match
being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.
Hope it help =) Make sure to ask more if needed =)