Elastic Search aggregation filter array for condition-CodePudding

My data looks like the following:

[
    {
        "name": "Scott",
        "origin": "London",
        "travel": [
            {
                "active": false,
                "city": "Berlin",
                "visited": "2020-02-01"
            },
            {
                "active": true,
                "city": "Prague",
                "visited": "2020-02-15"
            }
        ]
    },
    {
        "name": "Lilly",
        "origin": "London",
        "travel": [
            {
                "active": true,
                "city": "Scotland",
                "visited": "2020-02-01"
            }
        ]
    }
]

I want to perform an aggregation where each top-level origin is a bucket, then a nested aggregation to see how many people are currently visiting each city. I therefore only care what the city is if active is true.

Using a filter, it will search the visited array and return the complete array (both objects) if one has active set to true. I do not want to include cities where active is false.

Desired output:

{
  "aggregations": {
    "origin": {
      "buckets": [
        {
          "key": "London",
          "buckets": [
            {
              "key": "travel",
              "doc_count": 2555,
              "buckets": [
                {
                  "key": "Scotland",
                  "doc_count": 1
                },
                {
                  "key": "Prague",
                  "doc_count": 1
                }
              ]
            }
          ]
        }
      ]
    }
  }
}

Above I only have 2 counts of under the travel aggregation because only two travel objects have active set to true.

Currently, I have my aggregation set up like so:

{
  "from": 0,
  "aggs": {
    "origin": {
      "terms": {
        "field": "origin"
      },
      "aggs": {
        "travel": {
          "filter": {
            "term": {
              "travel.active": true
            }
          },
          "aggs": {
            "city": {
              "terms": {
                "field": "city"
              }
            }
          }
        }
      }
    }
  }
}

I have my top level aggregation on origin, then a nested agg on the travel array. Here I have a filter on travel.active = true, then another nested agg to create buckets for each city.

In my aggregation, it's still producing Berlin as a city, even though I'm filtering on active = true.

My guess is because it's allowing it as active: true is true for one of the objects in the array.

How do I filter out active: false completely from the aggregation?

CodePudding user response：

You will have to use "nested aggregation." Official documentation link for a reference

Here is an example for your query:

Mapping:

PUT /city_index
{
  "mappings": {
    "properties": {
      "name" : { "type" : "keyword" },
      "origin" : { "type" : "keyword" },
      "travel": { 
        "type": "nested",
        "properties": {
          "active": {
            "type": "boolean"
          },
          "city": {
            "type": "keyword"
          },
          "visited" : {
            "type":"date"
          }
        }
      }
    }
  }
}

Insert:

PUT /city_index/_doc/1
{
  "name": "Scott", 
  "origin" : "London",
  "travel": [
    {
      "active": false,
      "city": "Berlin",
      "visited" : "2020-02-01"
    },
    {
      "active": true,
      "city": "Prague",
      "visited": "2020-02-15"
    }
  ]
}

PUT /city_index/_doc/2
{
  "name": "Lilly",
  "origin": "London",
  "travel": [
    {
      "active": true,
      "city": "Scotland",
      "visited": "2020-02-01"
    }
  ]
}

Query:

GET /city_index/_search
{
  "size": 0,
  "aggs": {
    "origin": {
      "terms": {
        "field": "origin"
      },
      "aggs": {
        "city": {
          "nested": {
            "path": "travel"
          },
          "aggs": {
            "travel": {
              "filter": {
                "term": {
                  "travel.active": true
                }
              },
              "aggs": {
                "city": {
                  "terms": {
                    "field": "travel.city"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Output:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "origin": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "London",
          "doc_count": 2,
          "city": {
            "doc_count": 3,
            "travel": {
              "doc_count": 2,
              "city": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                  {
                    "key": "Prague",
                    "doc_count": 1
                  },
                  {
                    "key": "Scotland",
                    "doc_count": 1
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

CodePudding user response：

The suggestion @karthick is good but I add the filter in query. That way you will have a smaller amount of values in the aggregation stage.

GET idx_travel/_search
{
  "size": 0,
  "query": {
    "nested": {
      "path": "travel",
      "query": {
        "term": {
          "travel.active": {
            "value": true
          }
        }
      }
    }
  },
  "aggs": {
    "origin": {
      "terms": {
        "field": "origin"
      },
      "aggs": {
        "city": {
          "nested": {
            "path": "travel"
          },
          "aggs": {
            "city": {
              "terms": {
                "field": "travel.city"
              }
            }
          }
        }
      }
    }
  }
}