Perform a pipelines aggregation over the full set of potential buckets-CodePudding

When using the _search API of Elasticsearch, if you set size to 10, and perform an avg metric aggregation, the average will be of all values across the dataset matching the query, not just the average of the 10 items returned in the hits array.

On the other hand, if you perform a terms aggregation and set the size of the terms aggregation to be 10, then performing an avg_buckets aggregation on those terms buckets will calculate an average over only those 10 buckets - not all potential buckets.

How can I calculate the an average of some field across all potential buckets, but still only have 10 items in the buckets array?

To make my question more concrete, consider this example: Suppose that I am a hat maker. Multiple stores carry my hats. I have an Elasticsearch index hat-sales which has one document for each time one of my hats is sold. Included in this document is price and that store at which the hat was sold.

Here are two examples of the documents I tested this on:

{
  "type": "top",
  "color": "black",
  "price": 19,
  "store": "Macy's"
}
{
  "type": "fez",
  "color": "red",
  "price": 94,
  "store": "Walmart"
}

If I want to find the average price of all the hats I have sold, I can run this:

GET hat-sales/_search
{
  "size": 0, 
  "query": {
    "match_all": {}
  },
  "aggs": {
    "average_hat_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}

And average_hat_price will be the same whether size is set to 0, 3, or whatever.

OK, now I want to find the top 3 stores which have sold the most number of hats. I also want to compare them with the average number of hats sold at a store. So I want to do something like this:

GET hat-sales/_search
{
  "size": 0, 
  "query": {
    "match_all": {}
  },
  "aggs": {
    "by_store": {
      "terms": {
        "field": "store.keyword",
        "size": 3
      },
      "aggs": {
        "sales_count": {
          "cardinality": {
            "field": "_id"
          }
        }
      }
    },
    "avg sales at a store": {
      "avg_bucket": {
        "buckets_path": "by_store>sales_count"
      }
    }
  }
}

which yields a response of

"aggregations" : {
    "by_store" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 8,
      "buckets" : [
        {
          "key" : "Macy's",
          "doc_count" : 6,
          "sales_count" : {
            "value" : 6
          }
        },
        {
          "key" : "Walmart",
          "doc_count" : 5,
          "sales_count" : {
            "value" : 5
          }
        },
        {
          "key" : "Dillard's",
          "doc_count" : 3,
          "sales_count" : {
            "value" : 3
          }
        }
      ]
    },
    "avg sales at a store" : {
      "value" : 4.666666666666667
    }
  }

The problem is that avg sales at a store is calculated over only Macy's, Walmart, and Dillard's. If I want to find the average over all store, I have to set aggs.by_store.terms.size to 65536. (65536 because that is the default maximum number of terms buckets and I do not know a priori how many buckets there may be.) This gives a result of:

"aggregations" : {
    "by_store" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Macy's",
          "doc_count" : 6,
          "sales_count" : {
            "value" : 6
          }
        },
        {
          "key" : "Walmart",
          "doc_count" : 5,
          "sales_count" : {
            "value" : 5
          }
        },
        {
          "key" : "Dillard's",
          "doc_count" : 3,
          "sales_count" : {
            "value" : 3
          }
        },
        {
          "key" : "Target",
          "doc_count" : 3,
          "sales_count" : {
            "value" : 3
          }
        },
        {
          "key" : "Harrod's",
          "doc_count" : 2,
          "sales_count" : {
            "value" : 2
          }
        },
        {
          "key" : "Men's Warehouse",
          "doc_count" : 2,
          "sales_count" : {
            "value" : 2
          }
        },
        {
          "key" : "Sears",
          "doc_count" : 1,
          "sales_count" : {
            "value" : 1
          }
        }
      ]
    },
    "avg sales at a store" : {
      "value" : 3.142857142857143
    }
  }

So the average number of hats sold per store is 3.1, not 4.6. But in the buckets array I want to see only the top 3 stores.

CodePudding user response：

You can achieve what you are aiming at without a pipeline aggregation. It sort of cheats the aggregation framework, but, it works.

Here is the data setup:

PUT hat_sales
{
  "mappings": {
    "properties": {
      "storename": {
        "type": "keyword"
      }
    }
  }
}

POST hat_sales/_bulk?refresh=true
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "bar"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}

Here is the tricky query:

GET hat_sales/_search?size=0
{
  "aggs": {
    "stores": {
      "terms": {
        "field": "storename",
        "size": 2
      }
    },
    "average_sales_count": {
      "avg_bucket": {
        "buckets_path": "stores>_count"
      }
    },
    "cheat": {
      "filters": {
        "filters": {
          "all": {
            "exists": {
              "field": "storename"
            }
          }
        }
      },
      "aggs": {
        "count": {
          "value_count": {
            "field": "storename"
          }
        },
        "unique_count": {
          "cardinality": {
            "field": "storename"
          }
        },
        "total_average": {
          "bucket_script": {
            "buckets_path": {
              "total": "count",
              "unique": "unique_count"
            },
            "script": "params.total / params.unique"
          }
        }
      }
    }
  }
}

This is a small abuse of the aggs framework. But, the idea is that you effectively want num_stores/num_docs. I restricted the num_docs to only docs that actually have the storefield name.

I got around some validations by using the filters agg which is technically a multi-bucket agg (though I only care about one bucket).

Then I get the unique count through cardinality (num stores) and the total count (value_count) and use a bucket_script to finish it off.

All in all, here is the slightly mangled result :D

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "cheat" : {
      "buckets" : {
        "all" : {
          "doc_count" : 6,
          "count" : {
            "value" : 6
          },
          "unique_count" : {
            "value" : 3
          },
          "total_average" : {
            "value" : 2.0
          }
        }
      }
    },
    "stores" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 1,
      "buckets" : [
        {
          "key" : "baz",
          "doc_count" : 3
        },
        {
          "key" : "foo",
          "doc_count" : 2
        }
      ]
    },
    "average_sales_count" : {
      "value" : 2.5
    }
  }
}

Note that cheat.buckets.all.total_average is 2.0 (the true average) while the old way (pipeline average) is the non-global average of 2.5