I have a variety of behavior data in a big Elasticsearch database, and I'd like to do some analysis. In particular, I want to look at repeat behaviors by the time of day, but not different days.
This is probably clearer with an example. Imagine that I have purchase data for a supermarket in an Elasticsearch index. That will have a date-time field that records the day, month, year, hour, minute, etc. someone made a purchase.
I want to figure out whether or not this given person has a habit or not. For instance, whether they always come to the store and buy something at 8:00 (the hour component of their purchase time is consistently the value 8) or they're not following a pattern (the hour component is highly variable). Maybe they come in on different days; I don't really care. I just want to know, if they come to the store, do they make that trip at a predictable time?
Obviously I could figure this out by having Elasticsearch pull all the data, and then do the aggregation myself in post-processing code. Parse the dates, extract only the hour component, do my analysis.
Is there any way to do this "natively" in Elasticsearch with an aggregation query? I was unable to find anything obvious in the documentation.
CodePudding user response:
You can also use a script during aggregation. See the example below:
POST idx_buy/_doc
{
"user":"A",
"buy_date":"2022-05-13T12:00:10"
}
POST idx_buy/_doc
{
"user":"B",
"buy_date":"2022-05-13T13:00:10"
}
POST idx_buy/_doc
{
"user":"A",
"buy_date":"2022-05-13T12:10:10"
}
GET idx_buy/_search
{
"size": 0,
"aggs": {
"user": {
"terms": {
"field": "user.keyword",
"size": 10
},
"aggs": {
"buy-by-hours": {
"terms": {
"script": {
"source": "return doc['buy_date'].value.getHour();"
},
"size": 10
}
}
}
}
}
}
Response:
"aggregations" : {
"user" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A",
"doc_count" : 2,
"buy-by-hours" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "12",
"doc_count" : 2
}
]
}
},
{
"key" : "B",
"doc_count" : 1,
"buy-by-hours" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "13",
"doc_count" : 1
}
]
}
}
]
}
}
CodePudding user response:
I don't know if I understand exactly your requirements. Here is my solution shortly. You can use the following query to find the most purchased hours for a user :
GET transactions/_search
{
"query": {
"term": {
"user_id": {
"value": 1
}
}
},
"size": 0,
"runtime_mappings": {
"hours": {
"type": "keyword",
"script": {
"source": "DateTimeFormatter dtf = DateTimeFormatter.ofPattern(\"HH\"); emit(doc['transaction_date'].value.format(dtf));"
}
}
},
"aggs": {
"hours": {
"terms": {
"field": "hours"
}
}
}
}
Extracted hours from the date and aggregated by hours for a specific user. So, the day or other parts of the DateTime is meaningless. And here is the result :
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"ter" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "15",
"doc_count" : 3
},
{
"key" : "13",
"doc_count" : 2
},
{
"key" : "11",
"doc_count" : 1
},
{
"key" : "12",
"doc_count" : 1
}
]
}
}
}
According to the result, user 1 comes to the store and buys something at 3 pm frequently. There are some other hours, but the most popular hour is 3 pm for this user. So, you can find the most popular hours per user with sub aggregation with users.
GET transactions/_search
{
"size": 0,
"runtime_mappings": {
"hours": {
"type": "keyword",
"script": {
"source": "DateTimeFormatter dtf = DateTimeFormatter.ofPattern(\"HH\"); emit(doc['transaction_date'].value.format(dtf));"
}
}
},
"aggs": {
"users": {
"terms": {
"field": "user_id",
"size": 10
},
"aggs": {
"hours": {
"terms": {
"field": "hours",
"size": 1
}
}
}
}
}
}
The previous request will return the most purchased hours per user.
Here is my sample data: https://gist.github.com/hkulekci/120ca98816e50c9edf80a935d60477d4