I've been reading a lot about this topic because I've seen it has been asked before but I can do it works yet.
I am trying to get unique values from an index.
I have something like this:
id | app_name | url
1 | app_1 | https://subdomain.app_1.com
2 | app_1 | https://app_1.com
3 | app_2 | https://app_1.com
4 | app_3 | https://subdomain.app_3.com
5 | app_1 | https://app_3.com
I would like to receive just the distinct app_name:
app_1
app_2
app_3
The query I tried with aggs is:
GET app_index/_search
{
"aggs": {
"unique_apps": {
"terms": {
"field": "app_name",
}
}
}
}
I also tried a kind of group by here:
GET app_index/_search
{
"aggs": {
"unique_apps": {
"terms": {
"field": "app_name.keyword"
},
"aggs": {
"oneRecord": {
"top_hits": {
"size": 1
}
}
}
}
}
}
But I still receive all the apps.
- Is there a way to receive unique values?
- Maybe is there a possibility to check in
logstash
if some value exists in the database and avoid sending it again? Or maybe use thefingerprint
plugin and generate an unique_id
according to the value of the field? If I receive the same information in that field it could generate the same ID so it won't be saved again.
- I also checked if there's any possibility to create unique fields in Elasticsearch but I see it's not possible.
I also added the question in the elastic discuss forum: https://discuss.elastic.co/t/distinct-values-dsl-query/302715
Thank you very much for your help and time
CodePudding user response:
- Is there a way to receive unique values?
I've used the fingerprint
plugin in this case. I've generated an unique ID based on the string. e.g, if I receive the same app_name
name it will generate always the same _id
so it won't be repeated in elasticsearch. I've added this config in the logstash.conf
pipeline, specifically in the filter
side:
fingerprint {
source => ["app_name"]
target => ["unique_id_by_app_name"]
method => "SHA1"
}
Then in the output
:
elasticsearch {
hosts => "localhost:9200"
index => "logstash_apps"
document_id => "%{[unique_id_by_app_name]}"
}
If I receive again the app_1
with the same or even different data I'll have the same ID because the hashing:
$ -> echo -n "app_1" | sha1sum | awk -F ' -' '{print $1}'
87dbad46d7c47f3714eb02ff70e18b94e4ee6523
It can also be an answer for the second question.
- I also checked if there's any possibility to create unique fields in Elasticsearch but I see it's not possible
Definitively no. The unique field will be always _id
.