There is an aggregation to identify duplicate records:
{
"size": 0,
"aggs": {
"myfield": {
"terms": {
"field": "myfield.keyword",
"size": 250,
"min_doc_count": 2
}
}
}
}
However it is missing many duplicates due to the low size. The actual cardinality is over 2 million. If size
is changed to the actual size or some other much larger number, all of the duplicate documents are found, but the operation takes 5X more time to complete.
If I change the size
to a larger number, should I expect slow performance or other adverse effects on other operations while this is running?
CodePudding user response:
Yes, size
param is very critical in Elasticsearch aggregation performance and if you change it very big number like 10k
(limit set by Elasticsearch but you can change that by changing search.max_buckets
but it will surely have adverse impact not only on the aggregation you are running but on all the operation running in Elasticsearch cluster.
As you are using terms
aggregation which is of bucket aggregation, you can read more