How can I aggregate only on the hour component in Elasticsearch?-CodePudding

I have a variety of behavior data in a big Elasticsearch database, and I'd like to do some analysis. In particular, I want to look at repeat behaviors by the time of day, but not different days.

This is probably clearer with an example. Imagine that I have purchase data for a supermarket in an Elasticsearch index. That will have a date-time field that records the day, month, year, hour, minute, etc. someone made a purchase.

I want to figure out whether or not this given person has a habit or not. For instance, whether they always come to the store and buy something at 8:00 (the hour component of their purchase time is consistently the value 8) or they're not following a pattern (the hour component is highly variable). Maybe they come in on different days; I don't really care. I just want to know, if they come to the store, do they make that trip at a predictable time?

Obviously I could figure this out by having Elasticsearch pull all the data, and then do the aggregation myself in post-processing code. Parse the dates, extract only the hour component, do my analysis.

Is there any way to do this "natively" in Elasticsearch with an aggregation query? I was unable to find anything obvious in the documentation.

CodePudding user response：

You can also use a script during aggregation. See the example below:

POST idx_buy/_doc
{
  "user":"A",
  "buy_date":"2022-05-13T12:00:10"
}

POST idx_buy/_doc
{
  "user":"B",
  "buy_date":"2022-05-13T13:00:10"
}

POST idx_buy/_doc
{
  "user":"A",
  "buy_date":"2022-05-13T12:10:10"
}

GET idx_buy/_search
{
  "size": 0,
  "aggs": {
    "user": {
      "terms": {
        "field": "user.keyword",
        "size": 10
      },
      "aggs": {
        "buy-by-hours": {
          "terms": {
            "script": {
              "source": "return doc['buy_date'].value.getHour();"
            },
            "size": 10
          }
        }
      }
    }
  }
}

Response:

  "aggregations" : {
    "user" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 2,
          "buy-by-hours" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "12",
                "doc_count" : 2
              }
            ]
          }
        },
        {
          "key" : "B",
          "doc_count" : 1,
          "buy-by-hours" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "13",
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }

CodePudding user response：

I don't know if I understand exactly your requirements. Here is my solution shortly. You can use the following query to find the most purchased hours for a user :

GET transactions/_search
{
  "query": {
    "term": {
      "user_id": {
        "value": 1
      }
    }
  }, 
  "size": 0, 
  "runtime_mappings": {
    "hours": {
      "type": "keyword",
      "script": {
        "source": "DateTimeFormatter dtf = DateTimeFormatter.ofPattern(\"HH\"); emit(doc['transaction_date'].value.format(dtf));"
      }
    }
  },
  "aggs": {
    "hours": {
      "terms": {
        "field": "hours"
      }
    }
  }
}

Extracted hours from the date and aggregated by hours for a specific user. So, the day or other parts of the DateTime is meaningless. And here is the result :

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "ter" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "15",
          "doc_count" : 3
        },
        {
          "key" : "13",
          "doc_count" : 2
        },
        {
          "key" : "11",
          "doc_count" : 1
        },
        {
          "key" : "12",
          "doc_count" : 1
        }
      ]
    }
  }
}

According to the result, user 1 comes to the store and buys something at 3 pm frequently. There are some other hours, but the most popular hour is 3 pm for this user. So, you can find the most popular hours per user with sub aggregation with users.

GET transactions/_search
{
  "size": 0, 
  "runtime_mappings": {
    "hours": {
      "type": "keyword",
      "script": {
        "source": "DateTimeFormatter dtf = DateTimeFormatter.ofPattern(\"HH\"); emit(doc['transaction_date'].value.format(dtf));"
      }
    }
  },
  "aggs": {
    "users": {
      "terms": {
        "field": "user_id",
        "size": 10
      },
      "aggs": {
        "hours": {
          "terms": {
            "field": "hours",
            "size": 1
          }
        }
      }
    }
  }
}

The previous request will return the most purchased hours per user.

Here is my sample data: https://gist.github.com/hkulekci/120ca98816e50c9edf80a935d60477d4