Home > database >  How to calculate the count of regex query results?
How to calculate the count of regex query results?

Time:05-12

I have a large MongoDB collection (approx. 30M records) and each item has an array of unique numbers consisting of 8 digits. Most of the time that array has only 1 element (1 number). I want to find out how many records the collection holds with numbers starting with 4, for example, so I query:

{ "numbers.number": /^4.*/i }

However, the query takes too long, last time it took 20 minutes before I interrupted the execution. So I wonder if there's a way to optimize the query. numbers.number is indexed. I also tried this one:

{ "numbers.number": /^4[0-9]{7}/}

still takes too long. Here's an example of the document:

{ 
    "_id" : ObjectId("some_id"), 
    "created_at" : ISODate("2022-10-13T09:32:45.000 0000"), 
    "source" : {
        "created_at" : ISODate("2021-10-13T08:54:06.000 0000"), 
        "some_id" : NumberInt(234), 
        "another_id" : NumberInt(11)
    }, 
    "first_name" : "Test", 
    "last_name" : "Test", 
    "date_of_birth" : "1970-01-01", 
    "status" : "active", 
    "numbers" : [
        {
            "created_at" : ISODate("2022-11-13T09:32:45.000 0000"), 
            "number" : "40000005", 
            "_id" : ObjectId("some_id")
        }
    ]
}

CodePudding user response:

The regular expression is costly for performance and speed even if it has an index or not, if you have data in the millions count,

This is a similar question, MongoDB, performance of query by regular expression on indexed fields

I am not sure, I have not compared and tested the performance. but try just ^ sign without .*,

{ "numbers.number": /^4/ }

As per the additional note in regex index use documentation of MongoDB,

Additionally, while /^a/, /^a.*/, and /^a.*$/ match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however, /^a.*/, and /^a.*$/ are slower. /^a/ can stop scanning after matching the prefix.


The second option, I would suggest if you know the range of the number you could just use $gte and $lt operator to find the specific series by specifying numbers,

{ 
  "numbers.number": {
    "$gte": "40000000",
    "$lt": "50000000"
  }
}

Third, you can check multiple ranges by using $or operator,

{ 
  "$or": [
    {
      "numbers.number": {
        "$gte": "4000000",
        "$lt": "5000000"
      }
    },
    {
      "numbers.number": {
        "$gte": "40000000",
        "$lt": "50000000"
      }
    }
  ]
}

NOTE:

  • try to execute this query in MongoDB shell
  • always use count functions, if you just need counts of the documents
    • db.coll.find({query}).count()
    • db.coll.countDocuments({query})
  • Related