const result = await sandbox
.aggregate([
{
$match: {
B: { $eq: 500 },
// B: { $lt: 500 }, faster
// B: { $gt: 500 }, faster
},
},
{ $limit: 1000 },
])
B
is and un-indexed numeric, ranging randomly from 1 to 1000, field in a collection of 1.5m documents.
The queries for $gt
and $lt
are about 5 - 10 times faster than $eq
. I have repeatedly tested it.
What explains this difference?
CodePudding user response:
The distribution of the data and the (full) size of the documents that match the queries.
I believe the behavior that you are describing is expected given the scenario that was outlined. If anything, I would expect the performance difference to actually be larger than the one that was mentioned.
As there is no index for the database to utilize for this operation, it will be forced to do a collection scan. Now let's consider what that collection scan will look like during its execution. With 1.5 million
documents and 1,000
distinct (and random) values, we can expect that there are approximately 1,500
documents which contain each individual value. Based on this we would expect there would be about 1,500
total documents in the collection that match the query with the equality condition but about half of the documents (750,000
) to match the other queries with the range conditions.
Assuming a perfectly random distribution where each value showed up exactly once per 1,000
documents, the following would happen when the database began scanning the collection for each of the queries:
- Equality - each document that the database scans will have a
0.1%
chance (1/1000
) of matching the filter. As the query is requesting a result set size of1,000
documents, we can expect the database to have to scanresult_set_size / chance_of_matching = 1,000 / .001 = 1 million
documents (a full2/3
s of the collection) before it completes. - Range - each document that the database scans will have about a
50%
chance (500/1000
) of matching the filter. Using the equation from before, we now expect the database to scanresult_set_size / chance_of_matching = 1,000 / .5 = 2,000
documents before it completes.
That means that the database is potentially doing 3 orders of magnitude more work to satisfy the equality query than the range query. This is why my surprise is that there isn't more than a 10x difference in query duration.
Demonstration
I can't create enough documents on mongoplayground to appropriately demonstrate, so I did the following in my local environment. First I created the collection via:
var bulk = db.foo.initializeUnorderedBulkOp();
for(i = 0; i < 1500000; i ){bulk.insert({_id:i, B: Math.floor(Math.random()*1000)})}
bulk.execute();
This resulted in the following distribution of data:
> db.foo.count({B:500})
1592
> db.foo.count({B:{$gt:500}})
747873
> db.foo.count({B:{$lt:500}})
750535
Then we can use explain()
with executionStats
to determine the amount of work that the database had to do to execute these queries. For the equality:
> db.foo.aggregate([ { $match: { B: { $eq: 500 } } }, { $limit: 1000 }]).explain("executionStats").executionStats
{
nReturned: 1000,
executionTimeMillis: 332,
totalKeysExamined: 0,
totalDocsExamined: 915535,
...
And for the range queries:
> db.foo.aggregate([ { $match: { B: { $gt: 500 } } }, { $limit: 1000 }]).explain("executionStats").executionStats
{
nReturned: 1000,
executionTimeMillis: 1,
totalKeysExamined: 0,
totalDocsExamined: 2047,
...
> db.foo.aggregate([ { $match: { B: { $lt: 500 } } }, { $limit: 1000 }]).explain("executionStats").executionStats
{
nReturned: 1000,
executionTimeMillis: 1,
totalKeysExamined: 0,
totalDocsExamined: 1957,
All of these numbers are right in line with what were expecting based on the earlier musings.
Aside
By the way, this configuration helps demonstrate why indexes are most helpful when retrieving a small subset of results. If you were to add an index on the B
field, I would expect both of the following things to happen:
- The performance of the equality query to improve substantially.
- The performance of the range queries to be worse than when no index is present.
There is a cost associated with traversing an index prior to retrieving the associated data. It is often the case that a query which matches (very roughly) about 20%
or more of the collection will perform better by scanning the full collection rather than using an index.