When using the _search
API of Elasticsearch, if you set size to 10, and perform an avg
metric aggregation, the average will be of all values across the dataset matching the query, not just the average of the 10 items returned in the hits
array.
On the other hand, if you perform a terms
aggregation and set the size
of the terms
aggregation to be 10, then performing an avg_buckets
aggregation on those terms
buckets will calculate an average over only those 10 buckets - not all potential buckets.
How can I calculate the an average of some field across all potential buckets, but still only have 10 items in the buckets
array?
To make my question more concrete, consider this example: Suppose that I am a hat maker. Multiple stores carry my hats. I have an Elasticsearch index hat-sales
which has one document for each time one of my hats is sold. Included in this document is price and that store at which the hat was sold.
Here are two examples of the documents I tested this on:
{
"type": "top",
"color": "black",
"price": 19,
"store": "Macy's"
}
{
"type": "fez",
"color": "red",
"price": 94,
"store": "Walmart"
}
If I want to find the average price of all the hats I have sold, I can run this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_hat_price": {
"avg": {
"field": "price"
}
}
}
}
And average_hat_price
will be the same whether size
is set to 0, 3, or whatever.
OK, now I want to find the top 3 stores which have sold the most number of hats. I also want to compare them with the average number of hats sold at a store. So I want to do something like this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"by_store": {
"terms": {
"field": "store.keyword",
"size": 3
},
"aggs": {
"sales_count": {
"cardinality": {
"field": "_id"
}
}
}
},
"avg sales at a store": {
"avg_bucket": {
"buckets_path": "by_store>sales_count"
}
}
}
}
which yields a response of
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
}
]
},
"avg sales at a store" : {
"value" : 4.666666666666667
}
}
The problem is that avg sales at a store
is calculated over only Macy's, Walmart, and Dillard's. If I want to find the average over all store, I have to set aggs.by_store.terms.size
to 65536. (65536 because that is the default maximum number of terms buckets and I do not know a priori how many buckets there may be.) This gives a result of:
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Target",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Harrod's",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Men's Warehouse",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Sears",
"doc_count" : 1,
"sales_count" : {
"value" : 1
}
}
]
},
"avg sales at a store" : {
"value" : 3.142857142857143
}
}
So the average number of hats sold per store is 3.1, not 4.6. But in the buckets
array I want to see only the top 3 stores.
CodePudding user response:
You can achieve what you are aiming at without a pipeline aggregation. It sort of cheats the aggregation framework, but, it works.
Here is the data setup:
PUT hat_sales
{
"mappings": {
"properties": {
"storename": {
"type": "keyword"
}
}
}
}
POST hat_sales/_bulk?refresh=true
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "bar"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
Here is the tricky query:
GET hat_sales/_search?size=0
{
"aggs": {
"stores": {
"terms": {
"field": "storename",
"size": 2
}
},
"average_sales_count": {
"avg_bucket": {
"buckets_path": "stores>_count"
}
},
"cheat": {
"filters": {
"filters": {
"all": {
"exists": {
"field": "storename"
}
}
}
},
"aggs": {
"count": {
"value_count": {
"field": "storename"
}
},
"unique_count": {
"cardinality": {
"field": "storename"
}
},
"total_average": {
"bucket_script": {
"buckets_path": {
"total": "count",
"unique": "unique_count"
},
"script": "params.total / params.unique"
}
}
}
}
}
}
This is a small abuse of the aggs framework. But, the idea is that you effectively want num_stores/num_docs
. I restricted the num_docs
to only docs that actually have the storefield
name.
I got around some validations by using the filters agg which is technically a multi-bucket agg (though I only care about one bucket).
Then I get the unique count through cardinality (num stores) and the total count (value_count) and use a bucket_script
to finish it off.
All in all, here is the slightly mangled result :D
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"cheat" : {
"buckets" : {
"all" : {
"doc_count" : 6,
"count" : {
"value" : 6
},
"unique_count" : {
"value" : 3
},
"total_average" : {
"value" : 2.0
}
}
}
},
"stores" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : "baz",
"doc_count" : 3
},
{
"key" : "foo",
"doc_count" : 2
}
]
},
"average_sales_count" : {
"value" : 2.5
}
}
}
Note that cheat.buckets.all.total_average
is 2.0
(the true average) while the old way (pipeline average) is the non-global average of 2.5