I am new to MongoDB and NoSQL and I am trying to run queries on a huge data set (around 50 million documents)
I am running the latest version of MongoDB using Docker on a Windows 10 host with 64 GB RAM.
I am using pymongo
to import the data and run queries and I also have Mongo Express
running as a docker container to view the imported data.
My statement to create my time-series collection is:
mydb.command('create', 'sensor_data', timeseries={
'timeField': 'collection_time',
'metaField': 'sensor'
})
Each document looks something like this:
{
"sensor": { "id": 1, "location":"Somewhere"},
"collection_time": datetime.strptime("2022/01/01 01:23:45 PM", '%Y/%m/%d %I:%M:%S %p'),
...
}
I have been able to use Mongo Express
to verify that the data has been loaded into MongoDB correctly.
I then tried to run the following bit of code:
res = mycol.find({
"collection_time": {
"$gte": datetime.strptime("2021/01/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p'),
"$lte": datetime.strptime("2022/02/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p')
}
})
But this query takes a very long time to run.
After running res.explain()
, I can see that the operation is doing a COLSCAN
and not using an index. I even tried manually creating an index on 'collection_time' but the query is still doing a COLSCAN.
What am I missing?
Update 1
Thanks to R2D2's solution, I have got this working for find()
, but I can't get it to work for aggregate()
Here is my code:
res = mycol.aggregate([
{
"$match": {
"collection_time": {
"$gte": datetime.strptime("2021/01/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p'),
"$lte": datetime.strptime("2022/02/01 12:00:00 AM", '%Y/%m/%d %I:%M:%S %p')
}
}
},
{"$group":
{
"_id": {
"year" : { "$year" : "$collection_time" },
"month" : { "$month" : "$collection_time" },
"day" : { "$dayOfMonth" : "$collection_time" },
},
"count":{ "$sum": 1}
}
}
], {hint: "collection_time_1" })
This gives the error: NameError: name 'hint' is not defined
Putting hint
in quotes gives the error: AttributeError: 'dict' object has no attribute '_txn_read_preference'
CodePudding user response:
By default when you create Time Series collection it is effective for storing time series data , but there is no indexes created , you can create secondary indexes for Time Series collection to improve performance for queries and if query planner do not select some of the created indexes you can add hint() to the query with the index name ( you can get the indexes names with db.collection.getIndexes() )
For mongodb 3.6 you can use hint also in aggregation framework as follow:
db.collection.aggregate(pipeline, {hint: "index_name"})