Home > database >  How do I manage an index with more than 10,000,000 documents in Elasticsearch for fast searching?
How do I manage an index with more than 10,000,000 documents in Elasticsearch for fast searching?

Time:12-17

I have created an index in my elasticsearch server(localhost:9200) that takes log files of an application. This data is of more than 25GB (JSON-data) and it took me almost 3 hours to send it from Logstash to Elasticsearch.

According to http://localhost:9200/_cat/indices?v request, I can see that my index has more than 22 million documents.

health status index            uuid                   pri rep docs.count docs.deleted store.size pri.store.size

yellow open   test             i4nRcsZ8SJyjuKXpH2lI_A   1   1   22240500            0     11.5gb         11.5gb

When I search for a particular field and its value, it takes a lot of time for ES to search and get results from my Index. I know that my output will have more than 10,000 documents that's why I use the SCAN function instead of SEARCH in python.

My sole reason to choose ES was that it takes very little time to give outputs but in my case, it takes several minutes and in most of the testing I get a timeout error from ES.

    query = {
        "_source" : ['full_log'],
        "query" : {
                "match" : {
                    "location" : "192.168.0.1"
                }
        }
    }
rel = scan(client=es,             
               query=query,                                     
               scroll='1m',
               index='test',
               raise_on_error=True,
               preserve_order=False,
               clear_scroll=True)

How can I improve my search result time?

Is this how search engines for NETFLIX also retrieve data?

CodePudding user response:

Answer to your question has 2 "levels".

First level, literal. To make your query faster ensure you're using keyword field type for location, also try using term instead of match. Also, look through the Tune for search speed document.

Second level though urges to look at the big picture. If you're loading millions of documents into memory anyway, maybe it would be faster to load them from the original JSON and keep them in memory? Or load from JSON when needed? Or create a few JSON buckets for each location and quickly read one when needed?

Or maybe you don't really need to load all the docs at once and can process results in batches? Loading all data in memory won't scale and you can run out of memory if your data volume grows.

Elasticsearch is great for full text search, language processing and aggregations but if you use it as a simple storage the overhead gets significant.

  • Related