Home > Software design >  Sequential Searching Across Multiple Indexes In Elasticsearch
Sequential Searching Across Multiple Indexes In Elasticsearch

Time:04-19

Suppose I have Elasticsearch indexes in the following order:

index-2022-04
index-2022-05
index-2022-06
...

index-2022-04 represents the data stored in the month of April 2022, index-2022-05 represents the data stored in the month of May 2022, and so on. Now let's say in my query payload, I have the following timestamp range:

"range": {
    "timestampRange": {
        "gte": "2022-04-05T01:00:00.708363",  
        "lte": "2022-06-06T23:00:00.373772"                 
    }
}

The above range states that I want to query the data that exists between the 5th of April till the 6th of May. That would mean that I have to query for the data inside three indexes, index-2022-04, index-2022-05 and index-2022-06. Is there a simple and efficient way of performing this query across those three indexes without having to query for each index one-by-one?

I am using Python to handle the query, and I am aware that I can query across different indexes at the same time (see this SO post). Any tips or pointers would be helpful, thanks.

CodePudding user response:

You simply need to define an alias over your indices and query the alias instead of the indexes and let ES figure out which underlying indexes it needs to visit.

Eventually, for increased search performance, you can also configure index-time sorting on timestampRange, so that if your alias spans a full year of indexes, ES knows to visit only three of them based on the range constraint in your query (2022-04-05 -> 2022-04-05).

CodePudding user response:

Like you wrote, you can simply use a wildcard in and/or pass a list as target index.

The simplest way would be to to just query all of your indices with an asterisk wildcard (e.g. index-* or index-2022-*) as target. You do not need to define an alias for that, you can just use the wildcard in the target string, like so:

from elasticsearch import Elasticsearch

es_client = Elasticsearch('https://elastic.host:9200')

datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'

result = es_client.search(
             index = 'index-*',  
             query = { "bool": {
                         "must": [{ 
                             "range": {  
                                 "timestampRange": {
                                      "gte": datestring_start,  
                                      "lte": datestring_end                 
                                 }
                             }
                         }]
                     }
                 })

This will query all indices that match the pattern, but I would expect Elasticsearch to perform some sort of optimization on this.

Another option would be to first figure out on the Python side which sequence of indices you need to query and supply these as a comma-separated list (e.g. ['index-2022-04', 'index-2022-05', 'index-2022-06']) as target. You could e.g. use the Pandas date_range() function to easily generate such a list of indices, like so

from elasticsearch import Elasticsearch
import pandas as pd

es_client = Elasticsearch('https://elastic.host:9200')

datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'

months_list = pd.date_range(pd.to_datetime(datestring_start).to_period('M').to_timestamp(), datestring_end, freq='MS').strftime("index-%Y-%m").tolist()

result = es_client.search(
             index = months_list,
             query = { "bool": {
                         "must": [{ 
                             "range": {  
                                 "timestampRange": {
                                      "gte": datestring_start,  
                                      "lte": datestring_end                 
                                 }
                             }
                         }]
                     }
                 })
  • Related