I have an ES index, and I want to count the number of distinct CONTACT ID where [Have Agreement] flag is Y and N. The flag is unique for each CONTACT. However, when I add the contact with Y flag and N flag , the total count is different from total CONTACT number.
1.Total distinct CONTACT_ID count:
POST /dashboard/_search?size=0
{
"query": {
"bool": {
"must": [
{
"range": {
"CREATED": {
"gte": "2021-07-04T00:00:00.001Z",
"lte": "2021-12-31T00:00:00.001Z"
}
}
}
]
}
},
"aggs": {
"UniqueContact": {
"cardinality": {
"field": "CONTACT_ID.keyword"
}
}
}
}
result is 27588
2.Distinct CONTACT_ID count for Y and N flags respectively:
POST /dashboard/_search?size=0
{
"query": {
"bool": {
"must": [
{
"range": {
"CREATED": {
"gte": "2021-07-04T00:00:00.001Z",
"lte": "2021-12-31T00:00:00.001Z"
}
}
}
]
}
},"aggs": {
"CVID": {
"terms": {
"field": "Have Agreement.keyword",
"order": {
"type_count": "desc"
}
},
"aggs": {
"type_count": {
"cardinality": {
"field": "CONTACT_ID.keyword"
}
}
}
}
}
}
result is 2692 and 2158. They add up to 4850.
Evidence that shows the flag is unique for each contact
POST /dashboard/_search?size=0 { "query": { "bool": { "must": [ { "range": { "CREATED": { "gte": "2021-07-04T00:00:00.001Z", "lte": "2021-12-31T00:00:00.001Z" } } } ] } },"aggs": { "CVID": { "terms": { "field": "CONTACT_ID.keyword", "order": { "type_count": "desc" } }, "aggs": { "type_count": { "cardinality": { "field": "Have Agreement.keyword" } } } } } }
CodePudding user response:
Results seems to be coherent, according to your example.
Keep in mind cardinality are an approximation (you can set it to win some precision)
You have around 27588 distinct uniqueContact matching your query (cardinality is around 5% precision)
Top aggregation by Y or N (Have Agreement.keyword)
In the result we can read:
16725 documents with N
11190 documents with Y
- For the N group, you have around 2692 different uniqueContact
- For the Y group, you have around 2158 different uniqueContact
So you have "duplicate" matching documents, we can see this in your 3) part.
- 10 doc with 3-QV3ZBW uniqueContact
- 10 doc with 3-QV3ZC3 uniqueContact
=> So your second request is correct, you have around 2692 distinct uniqueContact with N value (2158 for Y)
The 2692 uniqueContact are present in 16725 docs, the 2158 others refers to 11190
16725 11190 => in the 27588 - 5%
PS: Add a query term on 3-QV3ZBW for example, I think this will answer to your question with a simple example.