I have a large MongoDB collection (approx. 30M records) and each item has an array of unique numbers consisting of 8 digits. Most of the time that array has only 1 element (1 number). I want to find out how many records the collection holds with numbers starting with 4, for example, so I query:
{ "numbers.number": /^4.*/i }
However, the query takes too long, last time it took 20 minutes before I interrupted the execution. So I wonder if there's a way to optimize the query. numbers.number
is indexed. I also tried this one:
{ "numbers.number": /^4[0-9]{7}/}
still takes too long. Here's an example of the document:
{
"_id" : ObjectId("some_id"),
"created_at" : ISODate("2022-10-13T09:32:45.000 0000"),
"source" : {
"created_at" : ISODate("2021-10-13T08:54:06.000 0000"),
"some_id" : NumberInt(234),
"another_id" : NumberInt(11)
},
"first_name" : "Test",
"last_name" : "Test",
"date_of_birth" : "1970-01-01",
"status" : "active",
"numbers" : [
{
"created_at" : ISODate("2022-11-13T09:32:45.000 0000"),
"number" : "40000005",
"_id" : ObjectId("some_id")
}
]
}
CodePudding user response:
The regular expression is costly for performance and speed even if it has an index or not, if you have data in the millions count,
This is a similar question, MongoDB, performance of query by regular expression on indexed fields
I am not sure, I have not compared and tested the performance. but try just ^
sign without .*
,
{ "numbers.number": /^4/ }
As per the additional note in regex index use documentation of MongoDB,
Additionally, while
/^a/
,/^a.*/
, and/^a.*$/
match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however,/^a.*/
, and/^a.*$/
are slower./^a/
can stop scanning after matching the prefix.
The second option, I would suggest if you know the range of the number you could just use $gte
and $lt
operator to find the specific series by specifying numbers,
{
"numbers.number": {
"$gte": "40000000",
"$lt": "50000000"
}
}
Third, you can check multiple ranges by using $or
operator,
{
"$or": [
{
"numbers.number": {
"$gte": "4000000",
"$lt": "5000000"
}
},
{
"numbers.number": {
"$gte": "40000000",
"$lt": "50000000"
}
}
]
}
NOTE:
- try to execute this query in MongoDB shell
- always use count functions, if you just need counts of the documents
db.coll.find({query}).count()
db.coll.countDocuments({query})