Home > Back-end >  Elasticsearch - How to down sample large datasets
Elasticsearch - How to down sample large datasets

Time:12-12

I'm trying to figure out how to query a large dataset so I could put it up on a js line chart. The index has millions of documents and I want to be able to show the entire series even if it's zoomed out.

The mapping kinda looks like this:

{
    "counter": {
        "type": long // used as kind of a sequential ID
    },
    "deposits": {
        "type": "nested",
        "properties": {
            "depositA": { "type": "long" },
            "depositB": { "type": "long" }
        }
    }
}

I want to show a line chart where the X axis is the counter values and the Y axis is the sum of the depositA and depositB values.

The dataset has about 7M docs so I'm thinking if I could get ES to return the average of every 7 rows,I could trim that down to 1M points for my chart and still have something that looks sensible. Possibly, even take it down to 100k points?

The problem is I don't really know where to start and I'm just very new to ES.

I tried histogram aggregations but it doesn't seem to be what I'm looking for.

POST /data/_search?size=0
{
  "aggs": {
    "counters": {
      "histogram": {
        "field": "counter",
        "interval": 50
      }
    }
  }
}

While this returns the counter field in 50 intervals, it also only gives me a doc count in those 50 counters (which I guess is just how histograms work?). I would like to know how to get the average value of depositA depositB across the 50 items along with the counter keys, if possible.

I'm really over my head here honestly but would love to learn. If anyone could point me to any helpful information that would be very much appreciated.

CodePudding user response:

In fact, histogram aggregation is the correct way to go, as I see. I think you need to put sub-aggregation on it. Here is an example for you:

POST indexa/_bulk
{"index": {"_id": "1"}}
{"counter": 1, "deposits": {"depositA": 10, "depositB": 15}}
{"index": {"_id": "2"}}
{"counter": 2, "deposits": {"depositA": 12, "depositB": 17}}
{"index": {"_id": "3"}}
{"counter": 3, "deposits": {"depositA": 16, "depositB": 16}}
{"index": {"_id": "4"}}
{"counter": 4, "deposits": {"depositA": 18, "depositB": 18}}



POST indexa/_search
{
  "size": 0,
  "aggs": {
    "range": {
      "histogram": {
        "field": "counter",
        "interval": 2
      },
      "aggs": {
        "nested": {
          "nested": {
            "path": "deposits"
          },
          "aggs": {
            "scripts": {
              "avg": {
                "script": {
                  "lang": "painless", 
                  "source": "return doc['deposits.depositA'].value   doc['deposits.depositB'].value"
                }
              }
            }
          }
        }
      }
    }
  }
}

I think this will work for you. I put an avg aggregation with sub-aggregation. I used nested aggregation because your deposits field is nested typed.

  • Related