My documents are like this:
{'start': 0, 'stop': 3, 'val': 3}
{'start': 2, 'stop': 4, 'val': 1}
{'start': 5, 'stop': 6, 'val': 4}
We can imagine that each document occupies the x-coordinates from 'start'
to 'stop'
,
and has a certain value 'val'
('start' < 'stop'
is guaranteed).
The goal is to plot a line showing the sum of these values 'val'
from all the
documents which occupy an x-coordinate:
In reality there are many documents with many different 'start'
and 'stop'
coordinates. Speed is important, so:
Is this possible to do with at most a couple of elastic search requests? how?
What I've tried:
With one elastic search request we can get the min_start
, and max_stop
coordinates. These will be the boundaries of x.
Then we divide the x-coordinates into N
intervals, and in a loop for each interval we make an elastic search request: to filter out all the documents which lie completely outside of this interval, and do a sum aggregation of 'val'
.
This approach takes too much time because there are N 1
requests, and if we want to have a line with higher precision, the time will increase linearly.
Code:
N = 300 # number of intervals along x
x = []
y = []
data = es.search(index='index_name',
body={
'aggs': {
'min_start': {'min': {'field': 'start'}},
'max_stop': {'max': {'field': 'stop'}}
}
})
min_x = data['aggregations']['min_start']['value']
max_x = data['aggregations']['max_stop']['value']
x_from = min_x
x_step = (max_x - min_x) / N
for _ in range(N):
x_to = x_from x_step
data = es.search(
index='index_name',
body= {
'size': 0, # to not return any actual documents
'query': {
'bool': {
'should': [
# start is in the current x-interval:
{'bool': {'must': [
{'range': {'start': {'gte': x_from}}},
{'range': {'start': {'lte': x_to}}}
]}},
# stop is in the current x-interval:
{'bool': {'must': [
{'range': {'stop': {'gte': x_from}}},
{'range': {'stop': {'lte': x_to}}}
]}},
# current x-interval is inside start--stop
{'bool': {'must': [
{'range': {'start': {'lte': x_from}}},
{'range': {'stop': {'gte': x_to}}}
]}}
],
'minimum_should_match': 1 # at least 1 of these 3 conditions should match
}
},
'aggs': {
'vals_sum': {'sum': {'field': 'val'}}
}
}
)
# Append info to the lists:
x.append(x_from)
y.append(data['aggregations']['vals_sum']['value'])
# Next x-interval:
x_from = x_to
from matplotlib import pyplot as plt
plt.plot(x, y)
CodePudding user response:
The right way to do this in one single query is to use the range
field type (available since 5.2) instead of using two fields start
and stop
and reimplementing the same logic. Like this:
PUT test
{
"mappings": {
"properties": {
"range": {
"type": "integer_range"
},
"val": {
"type":"integer"
}
}
}
}
Your documents would look like this:
{
"range" : {
"gte" : 0,
"lt" : 3
},
"val" : 3
}
And then the query would simply leverage an histogram
aggregation like this:
POST test/_search
{
"size": 0,
"aggs": {
"histo": {
"histogram": {
"field": "range",
"interval": 1
},
"aggs": {
"total": {
"sum": {
"field": "val"
}
}
}
}
}
}
And the results are as expected: 3, 3, 4, 1, 0, 4