Elasticsearch filter to obtain the last document for each value of given field-CodePudding

Consider the following set of documents :

{transaction: "A", date: "1999-10-27" }
{transaction: "B", date: "1999-10-26" }
{transaction: "C", date: "1999-10-26" }
{transaction: "A", date: "1999-10-27" }
{transaction: "A", date: "1999-12-25" }
{transaction: "B", date: "2000-10-25" }

I'm trying to add a filter that would only select the last document for each transaction, based of the value of the field date and obtain the following documents :

{transaction: "C", date: "1999-10-26" }
{transaction: "A", date: "1999-12-25" }
{transaction: "B", date: "2000-10-25" }

Also, a term aggregation wouldn't work because I also need to make a histogram aggregation (by year) on the resulting documents

{
    1999: 2
}, {
    2000: 1
}

CodePudding user response：

There are two ways to achieve what you need.

A. Using the collapse feature

GET test/_search
{
  "_source": false,
  "query": {
    "match_all": {}
  },
  "collapse": {
    "field": "transaction",
    "inner_hits": [
      {
        "name": "latest",      
        "size": 1,
        "sort": [
          {
            "date": {
              "order": "desc"
            }
          }
        ]
      }
    ]
  }
}

B. Using a terms aggregation on the transaction field a top_hits sorted on date for the latest transaction

GET test/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "transactions": {
      "terms": { 
        "field": "transaction" 
      },
      "aggs": {
        "latest": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                 "date": { 
                    "order": "desc"
                 }
              }
            ]                   
          }
        }
      }
    }
  }
}

Using the above query, it is then trivial to figure out a date histogram in your application logic.

UPDATE:

If you really want ES to build that date histogram for you, you can achieve this by leveraging the scripted_metric aggregation and building the aggregation logic yourself. Note that this solution uses scripting and it might impair the performance of your cluster depending on the volume of your data.

POST test/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "years": {
      "scripted_metric": {
        "init_script": "state.latest = [:]",
        "map_script": """
          // 1. record the latest year for each transaction
          def key = doc['transaction.keyword'].value;
          if (!state.latest.containsKey(key)) {
            state.latest[key] = 0;
          }
          def year = doc['date'].value.getYear();
          if (state.latest[key] < year) {
            state.latest[key] = year;
          }
        """,
        "combine_script": """
          return state.latest
        """,
        "reduce_script": """
          // 2. count how many documents per "latest" year
          def years = [:];
          states.stream().forEach(shardState -> {
            shardState.keySet().stream().forEach(transaction -> {
              def year = shardState[transaction].toString();
              if (!years.containsKey(year)) {
                years[year] = 0;
              }
              years[year]  ;
            });
          });
          return years;
        """
      }
    }
  }
}

For instance, the above aggregation query would work and return the following, which is pretty much what you expect:

  "aggregations" : {
    "years" : {
      "value" : {
        "2000" : 1,
        "1999" : 2
      }
    }
  }