Home > Software design >  How do I compute and add meta data to an existing Elasticsearch index?
How do I compute and add meta data to an existing Elasticsearch index?

Time:11-13

I loaded over 38 million documents (text strings) to an Elasticsearch index on my local machine. I would like to compute the length of each string and add that value as meta data in the index.

Should I have computed the string lengths as meta data before loading the documents to Elasticsearch? Or, can I update the meta data with a computed value after the fact?

I'm relatively new to Elasticsearch/Kibana and these questions arose because of the following Python experiments:

  1. Data as a list of strings

     mylist = ['string_1', 'string_2',..., 'string_N']
     L = [len(s) for s in mylist]  # this computation takes about 1 minute on my machine
    

    The downside of option 1 is that I'm not leveraging Elasticsearch and 'mylist' is occupying a large chunk of memory.

  2. Data as an Elasticsearch index where each string in 'mylist' was loaded into the field 'text'.

     from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
     document_store = ElasticsearchDocumentStore(host='localhost', username='', password='', index='myindex')
     docs = document_store.get_all_documents_generator()
     L = [len(d.text) for d in docs]  # this computation takes about 6 minutes on my machine
    

    The downside of option 2 is that it took much longer to compute. The upside is the generator() freed up memory. The long computation time is why I thought storing the string length (and other analytics) as meta data in Elasticsearch would be a good solution.

Are there other options I should consider? What am I missing?

CodePudding user response:

If you want to store the size of the whole document, I suggest installing the mapper-size plugin, which will store the size of the source document in the _size field.

If you only want to store the size of a specific field of your source document, then you need to proceed differently.

What I suggest is to create an ingest pipeline that will process each document just before it gets indexed. That ingest pipeline can then be used either when indexing the documents the first time or after having loaded the documents. I'll show you how.

First, create the ingest pipeline with a script processor that will store the size of the string in the text field in another field called textLength.

PUT _ingest/pipeline/string-length
{
  "description": "My optional pipeline description",
  "processors": [
    {
      "script": {
        "source": "ctx.textLength = ctx.text.length()"
      }
    }
  ]
}

So, in the case you've already loaded the documents into Elasticsearch and would like to enrich each document with the length of one of its fields, you can do it after the fact by using the Update by Query API, like this:

POST myindex/_update_by_query?pipeline=string-length&wait_for_completion=false

It is also possible to leverage that ingest pipeline at indexing time when the documents get indexed the first time, simply by referencing the pipeline in your index query, like this:

PUT myindex/_doc/123?pipeline=string-length

Both options will work, try it out and pick the one that best suits your needs.

  • Related