Home > Net >  MongoDB: Aggregation ($sort) on a union of collections very slow
MongoDB: Aggregation ($sort) on a union of collections very slow

Time:07-19

I have a few collections where I need to perform a union on, then query. However, I realise this is very slow for some reason. The explain is not that helpful as it only tells if the 1st $match stage is indexed. I am using a pipeline like:

[
    {
        "$match": {
            "$and": [
                { ... }
            ]
        }
    },

    // repeat this chunk for each collection
    {
        "$unionWith": {
            "coll": "anotherCollection",
            "pipeline": [
                {
                    "$match": {
                        "$and": [
                            { ... }
                        ]
                    }
                },
            ]
        }
    },

    // Then an overall limit/handle pagination for all the unioned results
    // UPDATE: Realised the sort is the culprit
    { "$sort": { "createdAt": -1 } },
    { "$skip": 0},
    { "$limit": 50 }
]

Is there a better way to do such a query? Does mongo do the unions in parallel maybe? Is there a "DB View" I can use to obtain a union of all the collections?

UPDATE: Just realised the runtime increase once I add the sort. I suspect it cannot use indexes because its on a union?

CodePudding user response:

Yes, there is a way. But it's not that trivial, you need to change how you do pagination. It requires more engineering, as you got to keep track of the page not only by number, but also by last elements found

If you paginate by filtering by a unique identifier (usually _id) with a cursor you can do early filtering.

!!! Important !!! You will need to keep track of the last item found instead of skipping a number of elements. If you don't do so, you will lose track of the pagination, and maybe never return some data, or return some twice, which is way worse than being slow

[
    {
        "$match": {
            "$and": [
                { ... }
            ],
            "_id":{"$gt": lastKnownIdOfCollectionA} // this will filter out everything you already saw, so no skip needed
        }
    },
    { "$sort": { "createdAt": -1 } }, // this sorting is indexed!
    { "$limit": 50 } // maybe you will take 0 but max 50, you don't care about the rest
    // repeat this chunk for each collection
    {
        "$unionWith": {
            "coll": "anotherCollection",
            "pipeline": [
                {
                    "$match": {
                        "$and": [
                            { ... }
                        ],
                        "_id":{"$gt": lastKnownIdOfCollectionB} // this will filter out everything you already saw, so no skip needed
                    }
                },
                { "$sort": { "createdAt": -1 } }, // this sorting is indexed!
                { "$limit": 50 } // maybe you will take 0 but max 50, you don't care about the rest
            ]
        }
    },

    // At this point you have MAX 100 elements, an index is not needed for sorting :)
    { "$sort": { "createdAt": -1 } },
    { "$skip": 0},
    { "$limit": 50 }
]

In this example, I do the early filter by _id which also contains the createdAt timestamp. If the filtering is not about the creation date you might have to define which identifier will suit the most. Remember the identifier must be a unique identifier, but you can use more than one value combined (eg. createdAt randomizedId)

CodePudding user response:

No, there is no way.

There are views, but they won't help with performance. They are merely a syntax sugar to save the aggregation query on database side. They will run exactly the same search down the road.

There are materialised views - essentially collections with saved results. They will work super fast, with a caveat of very eventual consistency, and only if you don't forget to refresh them. There are only few very niche cases I can imagine where materialised views could be helpful.

You will bee way better with storing all documents in a single collection at the first place, if you aim to optimise such queries.

  • Related