Home > Net >  DISTINCT values DSL query
DISTINCT values DSL query

Time:04-24

I've been reading a lot about this topic because I've seen it has been asked before but I can do it works yet.

I am trying to get unique values from an index.

I have something like this:

id | app_name       | url
1  | app_1          | https://subdomain.app_1.com
2  | app_1          | https://app_1.com
3  | app_2          | https://app_1.com
4  | app_3          | https://subdomain.app_3.com
5  | app_1          | https://app_3.com

I would like to receive just the distinct app_name:

app_1
app_2
app_3

The query I tried with aggs is:

GET app_index/_search
{
  "aggs": {
    "unique_apps": {
      "terms": {
        "field": "app_name",
      }
    }
  }
}

I also tried a kind of group by here:

GET app_index/_search
{
  "aggs": {
    "unique_apps": {
      "terms": {
        "field": "app_name.keyword"
      },
      "aggs": {
        "oneRecord": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

But I still receive all the apps.

  • Is there a way to receive unique values?
  • Maybe is there a possibility to check in logstash if some value exists in the database and avoid sending it again? Or maybe use the fingerprint plugin and generate an unique _id according to the value of the field? If I receive the same information in that field it could generate the same ID so it won't be saved again.
  • I also checked if there's any possibility to create unique fields in Elasticsearch but I see it's not possible.

I also added the question in the elastic discuss forum: https://discuss.elastic.co/t/distinct-values-dsl-query/302715

Thank you very much for your help and time

CodePudding user response:

  • Is there a way to receive unique values?

I've used the fingerprint plugin in this case. I've generated an unique ID based on the string. e.g, if I receive the same app_name name it will generate always the same _id so it won't be repeated in elasticsearch. I've added this config in the logstash.conf pipeline, specifically in the filter side:

fingerprint {
    source => ["app_name"]
    target => ["unique_id_by_app_name"]
    method => "SHA1"
  }

Then in the output:

    elasticsearch {
      hosts => "localhost:9200"
      index => "logstash_apps"
      document_id => "%{[unique_id_by_app_name]}"
    }

If I receive again the app_1 with the same or even different data I'll have the same ID because the hashing:

$ -> echo -n "app_1" | sha1sum | awk -F '  -' '{print $1}'
87dbad46d7c47f3714eb02ff70e18b94e4ee6523

It can also be an answer for the second question.

  • I also checked if there's any possibility to create unique fields in Elasticsearch but I see it's not possible

Definitively no. The unique field will be always _id.

  • Related