Elasticsearch - How to create buckets by using information from two fields at the same time?-CodePudding

My documents are like this:

{'start': 0, 'stop': 3, 'val': 3}
{'start': 2, 'stop': 4, 'val': 1}
{'start': 5, 'stop': 6, 'val': 4}

We can imagine that each document occupies the x-coordinates from 'start' to 'stop', and has a certain value 'val' ('start' < 'stop' is guaranteed).

The goal is to plot a line showing the sum of these values 'val' from all the documents which occupy an x-coordinate:

^{this graph online}

In reality there are many documents with many different 'start' and 'stop' coordinates. Speed is important, so:

Is this possible to do with at most a couple of elastic search requests? how?

What I've tried:

With one elastic search request we can get the min_start, and max_stop coordinates. These will be the boundaries of x.

Then we divide the x-coordinates into N intervals, and in a loop for each interval we make an elastic search request: to filter out all the documents which lie completely outside of this interval, and do a sum aggregation of 'val'.

This approach takes too much time because there are N 1 requests, and if we want to have a line with higher precision, the time will increase linearly.

Code:

N = 300  # number of intervals along x

x = []
y = []

data = es.search(index='index_name',
                 body={
                    'aggs': {
                        'min_start': {'min': {'field': 'start'}},
                        'max_stop': {'max': {'field': 'stop'}}
                    }
                 })
min_x = data['aggregations']['min_start']['value']
max_x = data['aggregations']['max_stop']['value']

x_from = min_x
x_step = (max_x - min_x) / N
for _ in range(N):
    x_to = x_from   x_step
    
    data = es.search(
        index='index_name', 
        body= {
            'size': 0,  # to not return any actual documents

            'query': {
                'bool': {
                    'should': [
                        # start is in the current x-interval:
                        {'bool': {'must': [
                            {'range': {'start': {'gte': x_from}}},
                            {'range': {'start': {'lte': x_to}}}
                        ]}},

                        # stop is in the current x-interval:
                        {'bool': {'must': [
                            {'range': {'stop': {'gte': x_from}}},
                            {'range': {'stop': {'lte': x_to}}}
                        ]}},

                        # current x-interval is inside start--stop
                        {'bool': {'must': [
                            {'range': {'start': {'lte': x_from}}},
                            {'range': {'stop': {'gte': x_to}}}
                        ]}}
                    ],
                    'minimum_should_match': 1  # at least 1 of these 3 conditions should match
                }
            },

            'aggs': {
                'vals_sum': {'sum': {'field': 'val'}}
            }
        }
    )
    
    # Append info to the lists:
    x.append(x_from)    
    y.append(data['aggregations']['vals_sum']['value'])
    
    # Next x-interval:
    x_from = x_to


from matplotlib import pyplot as plt
plt.plot(x, y)

CodePudding user response：

The right way to do this in one single query is to use the range field type (available since 5.2) instead of using two fields start and stop and reimplementing the same logic. Like this:

PUT test 
{
  "mappings": {
    "properties": {
      "range": {
        "type": "integer_range"
      },
      "val": {
        "type":"integer"
      }
    }
  }
}

Your documents would look like this:

    {
      "range" : {
        "gte" : 0,
        "lt" : 3
      },
      "val" : 3
    }

And then the query would simply leverage an histogram aggregation like this:

POST test/_search 
{
  "size": 0,
  "aggs": {
    "histo": {
      "histogram": {
        "field": "range",
        "interval": 1
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "val"
          }
        }
      }
    }
  }
}

And the results are as expected: 3, 3, 4, 1, 0, 4