Home > Software engineering >  Calculate exact count of distinct values for combination of 2 fields in Elasticsearch
Calculate exact count of distinct values for combination of 2 fields in Elasticsearch

Time:11-10

I have around 40 million records in my elasticsearch index. I want to calculate count of distinct values for combination of 2 fields.

Example for given set of documents:

[
 {
  "JobId" : 2,
  "DesigId" : 12
 },
 {
  "JobId" : 2,
  "DesigId" : 4
 },
 {
  "JobId" : 3,
  "DesigId" : 5
 },
 {
  "JobId" : 2,
  "DesigId" : 4
 },
 {
  "JobId" : 3,
  "DesigId" : 5
 }
]

For above example, I should get the count = 3 as only 3 distinct values exists : [(2,12),(2,4),(3,5)]

I tried using cardinality aggregation for this but that provides an approximate count. I want to calculate the exact count accurately.

Below is the query which I used using cardinality aggregation:

"aggs": {
        "counts": {
            "cardinality": {
                "script": "doc['JobId'].value   ','   doc['DesigId'].value",
                "precision_threshold": 40000
            }
        }
    }

I also tried using composite aggregation on combination of 2 fields using after key and counting the overall size of buckets but that process is really time taking and my query is getting timed out.

Is there any optimal way to achieve it?

CodePudding user response:

Scripting should be avoided as it affects performance. For your use case, there are 3 ways by which you can achieve your required results :

  1. Using Composite Aggregation (which you have already tried)
  2. Using Multi terms aggregation, but this is not memory efficient solution

Search Query :

{
    "size": 0,
    "aggs": {
        "jobId_and_DesigId": {
            "multi_terms": {
                "terms": [
                    {
                        "field": "JobId"
                    },
                    {
                        "field": "DesigId"
                    }
                ]
            }
        }
    }
}

Search Result:

"aggregations": {
        "jobId_and_DesigId": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": [
                        2,
                        4
                    ],
                    "key_as_string": "2|4",
                    "doc_count": 2
                },
                {
                    "key": [
                        3,
                        5
                    ],
                    "key_as_string": "3|5",
                    "doc_count": 2
                },
                {
                    "key": [
                        2,
                        12
                    ],
                    "key_as_string": "2|12",
                    "doc_count": 1
                }
            ]
        }
    }
  1. The combined field value (i.e., the combination of "JobId" and "DesigId") should be stored at the index time itself as this is the best method. This is possible by using a set processor.
PUT /_ingest/pipeline/concat
{
  "processors": [
    {
      "set": {
        "field": "combined_field",
        "value": "{{JobId}} {{DesigId}}"
      }
    }
  ]
}

Index API

When indexing the documents, you need to add pipeline=concat query param, each time you index the documents. Suppose a index API will look like :

POST _doc/1?pipeline=concat
{
    "JobId": 2,
    "DesigId": 12
    
}

Search Query:

{
    "size": 0,
    "aggs": {
        "jobId_and_DesigId": {
            "terms": {
                "field":"combined_field.keyword"
            }
        }
    }
}

Search Result:

 "aggregations": {
        "jobId_and_DesigId": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "2 4",
                    "doc_count": 2
                },
                {
                    "key": "3 5",
                    "doc_count": 2
                },
                {
                    "key": "2 12",
                    "doc_count": 1
                }
            ]
        }
    }

CodePudding user response:

Cardinality aggregation only gives approximate count. Since there are more than 40K documents using precision threshold will also not work.

You can use scripted metric aggregation. It will give accurate count but will considerably slower than cardinality aggregation.

{
  "aggs": {
    "Distinct_Count": {
      "scripted_metric": {
        "init_script": "state.list = []",
        "map_script": """
                            state.list.add(doc['JobId'].value '-' doc['DesigId'].value);
                      """,
        "combine_script": "return state.list;",
        "reduce_script":"""
                          Map uniqueValueMap = new HashMap(); 
                          int count = 0;
                          for(shardList in states) {
                            if(shardList != null) { 
                              for(key in shardList) {
                                if(!uniqueValueMap.containsKey(key)) {
                                  count  =1;
                                  uniqueValueMap.put(key, key);
                                }
                              }
                            }
                          } 
                          return count;
                        """
      }
    }
  }
}
  • Related