Home > Software engineering >  ElasticSearch - merge/combine the results by group
ElasticSearch - merge/combine the results by group

Time:02-16

Following is the data stored in ElasticSearch:

[
  {
    "id": 1,
    "class": "class 1",
    "name": "Scott",
    "scores": [
      { "year": 2022, "score": 100 },
      { "year": 2011, "score": 80 },
      { "year": 2003, "score": 70 }
    ]
  },
  {
    "id": 2,
    "class": "class 1",
    "name": "Gabriel",
    "scores": [
      { "year": 2015, "score": 90 },
      { "year": 2011, "score": 70 }
    ]
  },
  {
    "id": 3,
    "class": "class 2",
    "name": "Scott",
    "scores": [
      { "year": 2022, "score": 100 },
      { "year": 2021, "score": 100 },
      { "year": 2003, "score": 80 }
    ]
  },
  {
    "id": 4,
    "class": "class 2",
    "name": "Pierce",
    "scores": [
      { "year": 2022, "score": 70 }
    ]
  }
]

Is there a way in ElasticSearch to merge/combine scores into one array by specific group? (keep the duplicates values)

For example:

  1. Group by class, it will show the scores of class 1 and class 2, and just keep the class and scores fields, the result would be:
[
  {
    "class": "class 1",
    "scores": [
    { "year": 2022, "score": 100 },
    { "year": 2015, "score": 90 },
    { "year": 2011, "score": 80 },
    { "year": 2011, "score": 70 },
    { "year": 2003, "score": 70 }
    ]
  },
  {
    "class": "class 2",
    "scores": [
    { "year": 2022, "score": 100 },
    { "year": 2022, "score": 70 },
    { "year": 2021, "score": 100 },
    { "year": 2003, "score": 80 }
    ]
  }
]
  1. Group by name, it will put all the scores of Scott into one array, and keep only name and scores fields:
[
  {
    "name": "Scott",
    "scores": [
      { "year": 2022, "score": 100 },
      { "year": 2022, "score": 100 },
      { "year": 2021, "score": 100 },
      { "year": 2011, "score": 80 },
      { "year": 2003, "score": 80 },
      { "year": 2003, "score": 70 }
    ]
  },
  {
    "name": "Gabriel",
    "scores": [
      { "year": 2015, "score": 90 },
      { "year": 2011, "score": 70 }
    ]
  },
  {
    "name": "Pierce",
    "scores": [
      { "year": 2022, "score": 70 }
    ]
  }
]

Thanks!

CodePudding user response:

Disclaimer: Tighten your seatbelt it is going to be verbose ^^

TLDR;

Yes, it is possible, using term aggregation.

Such as, grouping by class:

You will find a bucket called byClass of type term. Elastic will create bucket of documents per value in the field class.

-> class 1 and class 2

But as you will notice it create more aggregation in this aggregation.

-> nestedAGG, byyear and bynotes

The first one is a specificity of Elastic.

The 2 other further break down the bucket by year then notes respectively.

GET /71128503/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "byClass": {
      "terms": {
        "field": "class",
        "size": 10
      },
      "aggs": {
        "nestedAGG": {
          "nested": {
            "path": "scores"
          },
          "aggs": {
            "byyear": {
              "terms": {
                "field": "scores.year",
                "size": 10
              },
              "aggs": {
                "bynotes": {
                  "terms": {
                    "field": "scores.score",
                    "size": 10
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
{
  ...

  "aggregations" : {
    "byClass" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "class 1",
          "doc_count" : 2,
          "nestedAGG" : {
            "doc_count" : 5,
            "byyear" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : 2011,
                  "doc_count" : 2,
                  "bynotes" : {
                    "doc_count_error_upper_bound" : 0,
                    "sum_other_doc_count" : 0,
                    "buckets" : [
                      {
                        "key" : 70,
                        "doc_count" : 1
                      },
                      {
                        "key" : 80,
                        "doc_count" : 1
                      }
                    ]
                  }
                },
                {
                  "key" : 2003,
                  "doc_count" : 1,
                  "bynotes" : {
                    "doc_count_error_upper_bound" : 0,
                    "sum_other_doc_count" : 0,
                    "buckets" : [
                      {
                        "key" : 70,
                        "doc_count" : 1
                      }
                    ]
                  }
                },
                {
                  "key" : 2015,
                  "doc_count" : 1,
                  "bynotes" : {
                    "doc_count_error_upper_bound" : 0,
                    "sum_other_doc_count" : 0,
                    "buckets" : [
                      {
                        "key" : 90,
                        "doc_count" : 1
                      }
                    ]
                  }
                },
                {
                  "key" : 2022,
                  "doc_count" : 1,
                  "bynotes" : {
                    "doc_count_error_upper_bound" : 0,
                    "sum_other_doc_count" : 0,
                    "buckets" : [
                      {
                        "key" : 100,
                        "doc_count" : 1
                      }
                    ]
                  }
                }
              ]
            }
          }
        },
        {
          "key" : "class 2",
          "doc_count" : 2,
          "nestedAGG" : {
            "doc_count" : 4,
            "byyear" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : 2022,
                  "doc_count" : 2,
                  "bynotes" : {
                    "doc_count_error_upper_bound" : 0,
                    "sum_other_doc_count" : 0,
                    "buckets" : [
                      {
                        "key" : 70,
                        "doc_count" : 1
                      },
                      {
                        "key" : 100,
                        "doc_count" : 1
                      }
                    ]
                  }
                },
                {
                  "key" : 2003,
                  "doc_count" : 1,
                  "bynotes" : {
                    "doc_count_error_upper_bound" : 0,
                    "sum_other_doc_count" : 0,
                    "buckets" : [
                      {
                        "key" : 80,
                        "doc_count" : 1
                      }
                    ]
                  }
                },
                {
                  "key" : 2021,
                  "doc_count" : 1,
                  "bynotes" : {
                    "doc_count_error_upper_bound" : 0,
                    "sum_other_doc_count" : 0,
                    "buckets" : [
                      {
                        "key" : 100,
                        "doc_count" : 1
                      }
                    ]
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

To reproduce

To ingest the data. Notice the custom mapping:

  • keyword (grouping by term should not be done on text type fields)
  • Nested (Elastic by default flatten objects)
PUT /71128503/
{
  "settings": {},
  "mappings": {
    "properties": {
      "class": {
        "type": "keyword"
      },
      "name":{
        "type": "keyword"
      },
      "scores":{
        "type": "nested",
        "properties": {
          "score": {
            "type": "integer"
          },
          "year": {
            "type": "integer"
          }
        }
      }
    }
  }
}


POST /_bulk
{"index":{"_index":"71128503","_id":1}}
{"class":"class 1","name":"Scott","scores":[{"year":2022,"score":100},{"year":2011,"score":80},{"year":2003,"score":70}]}
{"index":{"_index":"71128503","_id":2}}
{"class":"class 1","name":"Gabriel","scores":[{"year":2015,"score":90},{"year":2011,"score":70}]}
{"index":{"_index":"71128503","_id":3}}
{"class":"class 2","name":"Scott","scores":[{"year":2022,"score":100},{"year":2021,"score":100},{"year":2003,"score":80}]}
{"index":{"_index":"71128503","_id":4}}
{"class":"class 2","name":"Pierce","scores":[{"year":2022,"score":70}]}

Then to query the data:

By class / By name

GET /71128503/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "byName": {           <- Name of your bucket
      "terms": {          <- Type of grouping, Elastic support many, like sum, avg on numeric value ....
        "field": "name",  <- Field you grouping on
        "size": 10
      },
      "aggs": {
        "nestedAGG": {
          "nested": {
            "path": "scores"
          },
          "aggs": {
            "byyear": {
              "terms": {
                "field": "scores.year",
                "size": 10
              },
              "aggs": {
                "bynotes": {
                  "terms": {
                    "field": "scores.score",
                    "size": 10
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
  • Related