MongoDB: inefficient collection / index scan (vs. efficient query using PostgreSQL)-CodePudding

I have a collection with 10M records, with the following schema:

{
   "_id": ObjectId
   "uid": string
}

uid is a random string for testing purposes, generated with unique values.

There are 2 indexes: {"_id": 1} and {"uid": 1}.

The following two queries produce very inefficient query plans, which scan/fetch all 10M documents and take several seconds to run:

db.test.find({}).sort({"uid": 1, "_id": 1}).limit(10).explain()

(...)
    "queryPlanner" : {
        "namespace" : "test.test",
        "indexFilterSet" : false,
        "parsedQuery" : {

        },
        "queryHash" : "EEBF6C62",
        "planCacheKey" : "BC5E0BD8",
        "maxIndexedOrSolutionsReached" : false,
        "maxIndexedAndSolutionsReached" : false,
        "maxScansToExplodeReached" : false,
        "winningPlan" : {
            "stage" : "SORT",
            "sortPattern" : {
                "uid" : 1,
                "_id" : 1
            },
            "memLimit" : 104857600,
            "limitAmount" : 10,
            "type" : "simple",
            "inputStage" : {
                "stage" : "COLLSCAN",
                "direction" : "forward"
            }
        },
        "rejectedPlans" : [ ]
    },
(...)

db.test.find({"uid": {"$gt": "a"}}).sort({"uid": 1, "_id": 1}).limit(10).explain()
(..)
    "queryPlanner" : {
        "namespace" : "test.test",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "uid" : {
                "$gt" : "a"
            }
        },
        "queryHash" : "F4847212",
        "planCacheKey" : "8F79D766",
        "maxIndexedOrSolutionsReached" : false,
        "maxIndexedAndSolutionsReached" : false,
        "maxScansToExplodeReached" : false,
        "winningPlan" : {
            "stage" : "SORT",
            "sortPattern" : {
                "uid" : 1,
                "_id" : 1
            },
            "memLimit" : 104857600,
            "limitAmount" : 10,
            "type" : "simple",
            "inputStage" : {
                "stage" : "FETCH",
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "uid" : 1
                    },
                    "indexName" : "uid_1",
                    "isMultiKey" : false,
                    "multiKeyPaths" : {
                        "uid" : [ ]
                    },
                    "isUnique" : false,
                    "isSparse" : false,
                    "isPartial" : false,
                    "indexVersion" : 2,
                    "direction" : "forward",
                    "indexBounds" : {
                        "uid" : [
                            "(\"a\", {})"
                        ]
                    }
                }
            }
        },
        "rejectedPlans" : [ ]
    },
(...)

The analogous datamodel and queries using PostgreSQL produce very efficient query plans than run in <1ms.

explain (analyze, buffers)
select * from test
order by uid, id
limit 10;

Limit  (cost=0.65..1.87 rows=10 width=69) (actual time=0.477..0.480 rows=10 loops=1)
  Buffers: shared hit=15
  ->  Incremental Sort  (cost=0.65..1221670.34 rows=10000000 width=69) (actual time=0.477..0.478 rows=10 loops=1)
        Sort Key: uid, id
        Presorted Key: uid
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=15
        ->  Index Scan using uid on test  (cost=0.56..771670.34 rows=10000000 width=69) (actual time=0.085..0.190 rows=11 loops=1)
              Buffers: shared hit=15
Planning Time: 0.400 ms
Execution Time: 0.836 ms

explain (analyze, buffers)
select * from test
where uid > 'a'
order by uid, id
limit 10;

Limit  (cost=0.65..1.90 rows=10 width=69) (actual time=0.289..0.292 rows=10 loops=1)
  Buffers: shared hit=15
  ->  Incremental Sort  (cost=0.65..1246670.34 rows=10000000 width=69) (actual time=0.287..0.288 rows=10 loops=1)
        Sort Key: uid, id
        Presorted Key: uid
        Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
        Buffers: shared hit=15
        ->  Index Scan using uid on test  (cost=0.56..796670.34 rows=10000000 width=69) (actual time=0.062..0.229 rows=11 loops=1)
              Index Cond: ((uid)::text > 'a'::text)
              Buffers: shared hit=15
Planning Time: 4.809 ms
Execution Time: 0.347 ms

Intuitively, since there is an index on uid, it should be very straightforward to just fetch N rows, already sorted by uid, and then sort also by _id within the same uid (which in this example is useless, since uid is unique). This seems exactly what PostgreSQL is doing.

Are the inefficient query plans just a limitation on the MongoDB query planner, or am I doing something wrong?

CodePudding user response：

I am not exactly sure, whether to term it a limitation, but yeah in the first case, for the query:

db.test.find({}).sort({"uid": 1, "_id": 1}).limit(10).explain()

MongoDB doesn't use any index, because the filter object is empty, and there is no index present, which is sorted in the order as expected. That's why you see the stage COLLSCAN. If you simply try running these, queries

db.test.find({}).sort({"uid": 1}).limit(10).explain()

db.test.find({}).sort({"_id": 1}).limit(10).explain()

You'll see the IXSCAN stage coming into play because there are indexes present, which can satisfy the sort order directly. To bring the index into play, you can simply add a filter like:

 db.test.find({uid: {$exists: true}}).sort({"_id": 1, "uid": 1}).limit(10).explain()

For the second query,

db.test.find({"uid": {"$gt": "a"}}).sort({"uid": 1, "_id": 1}).limit(10).explain()

MongoDB is behaving the same as PostgresSQL, it is using the uid index to filter the elements, and then sort the documents by _id, as the index is already sorted by uid.

For the best case scenario, if you need to sort both by uid and _id, in most cases, you can simply create a composite index consisting of both uid and _id, like this:

db.test.createIndex({uid: 1, _id: 1})

After creating this index, both of your queries will use this index, regardless of whether the filters are present or not.

CodePudding user response：

The fact that MongoDB is not doing an incremental sort like PostgreSQL appears to be a limitation of the system. It doesn't look like you are doing anything directly wrong in that regard. There are still two observations/recommendations that can be made here.

Compound Index

As @Charchit Kapoor mentions, a compound index on {uid: 1, _id: 1} will allow MongoDB to efficiently perform this query as written (and avoid a blocking sort altogether).

Use a Single Field

You mention the following:

uid is a random string for testing purposes, generated with unique values.

This leads to two questions:

Why are you using two separate unique fields (as opposed to, for example, just using the string value directly as your _id value)?
Assuming that you want both, why are you performing a compound sort?

Looking at the latter question, if the uid field is unique then appending _id to the sort does not change the results in any way. There is only a single document with a given uid value so there is no additional ordering that the trailing _id field provides.

Assuming that you want to keep both fields, then simply dropping _id from the sort should keep the result set the same while allowing MongoDB to efficiently use the { uid: 1 } index to perform the operation.