Home > Software engineering >  How to group documents in Elasticsearch and get the documents in each group?
How to group documents in Elasticsearch and get the documents in each group?

Time:04-29

My Elasticsearch index contains products with a denormalized m:n relationship to categories.

My goal is to derive a categories index from it which contains the same information, but with the relationship inverted.

The index looks like this:

PUT /products
{
    "mappings": {
        "properties": {
            "name": {
                "type": "keyword"
            },
            "article_id": {
                "type": "keyword"
            },
            "categories": {
                "type": "nested",
                "properties": {
                    "cat_name": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

containing documents created like this:

POST /products/_doc
{
  "name": "radio",
  "article_id": "1001",
  "categories": [
    { "cat_name": "audio" },
    { "cat_name": "electronics" }
  ]
}

POST /products/_doc
{
  "name": "fridge",
  "article_id": "1002",
  "categories": [
    { "cat_name": "appliances" },
    { "cat_name": "electronics" }
  ]
}

I would like to get something like this back from Elasticsearch:

{
  "name": "appliances",
  "products": [
    { 
      "name": "fridge",
      "article_id": "1002"
    }
  ]
},
{
  "name": "audio",
  "products": [
    { 
      "name": "radio",
      "article_id": "1001"
    }
  ]
},
{
  "name": "electronics",
  "products": [
    { 
      "name": "fridge",
      "article_id": "1002"
    },
    { 
      "name": "radio",
      "article_id": "1001"
    }
  ]
}

which would eventually be put into an index such as:

PUT /categories
{
    "mappings": {
        "properties": {
            "name": {
                "type": "keyword"
            },
            "products": {
                "type": "nested",
                "properties": {
                    "name": {
                        "type": "keyword"
                    },
                    "article_id": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

I cannot figure out how to do this without loading and grouping all products programmatically. Here's what I have tried:

  1. Bucket aggregation on field categories.cat_name

    This gives me the document count per category but not the product documents. Using top_hits sub-aggregation seems to be limited to 100 documents.

  2. Group using collapse field with expansion

    Collapsing is only possible on a single-valued field.

I'm using Elasticsearch 8.1.

CodePudding user response:

The query you need is this one:

POST products/_search
{
  "size": 0,
  "aggs": {
    "cats": {
      "nested": {
        "path": "categories"
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "categories.cat_name",
            "size": 10
          },
          "aggs": {
            "root": {
              "reverse_nested": {},
              "aggs": {
                "products": {
                  "terms": {
                    "field": "name",
                    "size": 10
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Which produces exactly what you need (less the article id, but that's easy):

    "buckets" : [
      {
        "key" : "electronics",
        "doc_count" : 2,
        "root" : {
          "doc_count" : 2,
          "products" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "fridge",
                "doc_count" : 1
              },
              {
                "key" : "radio",
                "doc_count" : 1
              }
            ]
          }
        }
      },
      {
        "key" : "appliances",
        "doc_count" : 1,
        "root" : {
          "doc_count" : 1,
          "products" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "fridge",
                "doc_count" : 1
              }
            ]
          }
        }
      },
      {
        "key" : "audio",
        "doc_count" : 1,
        "root" : {
          "doc_count" : 1,
          "products" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "radio",
                "doc_count" : 1
              }
            ]
          }
        }
      }
    ]
  • Related