Home > Net >  elasticsearch attachment plugin performance improvement
elasticsearch attachment plugin performance improvement

Time:04-14

I am new to elasticsearch attempting to parse pdf files via ingestion pipeline using the elasticsearch atachment plugin , but it seems it take alot of time to parse pdf depending on pdf size 1MB=2sec , 5MB=15sec, 10MB=25sec and so one , please, advice how to improve this execution time?

PUT _ingest/pipeline/attachment
{
 "description" : "Extract attachment information",
 "processors" : [
 {
  "attachment" : {
    "field" : "data"
  }
 }
]
}

PUT my-index-000001/_doc/my_id?pipeline=attachment
{
 "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

Thanks

CodePudding user response:

Its an expensive operation and will cost resources, I would explore using FSCrawler ( https://fscrawler.readthedocs.io/en/fscrawler-2.9/) or other Tika library to off-load the whole operation from ES; You might be able to get lot of things done in parallel or process data before its ready to index.

  • Related