I want to know:
- sum of rain (mm)
- sum of sunshine (hours)
- Probability (%) of a rainday with more than 0.5mm rain on weekends
On the weekends (sa so) between week 20 to 40 for the last 17years.
I have 820k documents in 10min periods. The request took sometimes 38sec but sometimes more than 1min.
Do you have an Idea how to improve performance?
data-Model:
[
'datum',
'regen',
'tempAussen',
'sonnenSchein',
and more...
]
schema:
[
{
$project: {
jahr: {
$year: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
woche: {
$week: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
day: {
$isoDayOfWeek: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
stunde: {
$hour: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
tagjahr: {
$dayOfYear: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
tempAussen: 1,
regen: 1,
sonnenSchein: 1,
},
},
{
$match: {
$and: [
{
woche: {
$gte: 20,
},
},
{
woche: {
$lte: 40,
},
},
{
day: {
$gte: 6,
},
},
],
},
},
{
$group: {
_id: ['$tagjahr', '$jahr'],
woche: {
$first: '$woche',
},
regen_sum: {
$sum: '$regen',
},
sonnenSchein_sum: {
$sum: '$sonnenSchein',
},
},
},
{
$project: {
_id: '$_id',
regenTage: {
$sum: {
$cond: {
if: {
$gte: ['$regen_sum', 0.5],
},
then: 1,
else: 0,
},
},
},
woche: 1,
regen_sum: 1,
sonnenSchein_sum: 1,
},
},
{
$group: {
_id: '$woche',
regen_sum: {
$sum: '$regen_sum',
},
sonnenSchein_sum: {
$sum: '$sonnenSchein_sum',
},
regenTage: {
$sum: '$regenTage',
},
},
},
{
$project: {
regenTage: 1,
regen_sum: {
$divide: ['$regen_sum', 34],
},
sonnenSchein_sum: {
$divide: ['$sonnenSchein_sum', 2040],
},
probability: {
$divide: ['$regenTage', 0.34],
},
},
},
{
$project: {
regen_sum: {
$round: ['$regen_sum', 1],
},
sonnenSchein_sum: {
$round: ['$sonnenSchein_sum', 1],
},
wahrscheinlich: {
$round: ['$probability', 0],
},
},
},
{
$sort: {
_id: 1,
},
},
]
this result is an example for week 20: on the weekend of calender week 20 I have in average 2.3mm rain, 11.9h sunshine and a probility of 35% that it will rain atleast on one day of the weekend
_id:20
regen_sum:2.3
sonnenSchein_sum:11.9
probability:35
CodePudding user response:
Without having the verbose explain output (.explain("allPlansExecution")
), it is hard to say anything for sure. Here are some observations from just taking a look at the aggregation pipeline that was provided (underneath "schema:").
Before going into observations, I must ask what your specific goals are. Are operations like these something you will be running frequently? Is anything faster than 38 seconds acceptable, or is there a specific runtime that you are looking for? As outlined below, there probably isn't much opportunity for direct improvement. Therefore it might be beneficial to look into other approaches to the problem, and I'll outline one at the end.
The first observation is that this aggregation is going to perform a full collection scan. Even if an index existed on the datum
field, it could not be used since the filtering in the $match
is done on new fields that are calculated from datum
. We could make some changes to allow an index to be used, but it probably wouldn't help. You are processing ~38% of your data (20
of the 52
weeks per year) so the overhead of doing the index scan and randomly fetching a significant portion of the data is probably more than just scanning the entire collection directly.
Secondly, you are currently $group
ing twice. The only reason for this seems to be so that you can determine if a day is considered 'rainy' first (more than 0.5mm
of rain). But the 'rainy day' indicator then effectively gets combined to become a 'rainy weekend' indicator in the second grouping. Although it could technically change the results a little due to the rounding done on the 24 hour basis, perhaps that small change would be worthwhile to eliminate one of the $group
stages entirely?
If this were my system, I would consider pre-aggregating some of this data. Specifically having daily summaries as opposed to the 10 minute intervals of raw data would really go a long way here in reducing the amount of processing that is required to generate summaries like this. Details for each day (which won't change) would then be contained in a single document rather than in 144
individual ones. That would certainly allow an aggregation logically equivalent to the one above to process much faster than what you are currently observing.