I am new to elasticsearch attempting to parse pdf files via ingestion pipeline using the elasticsearch atachment plugin , but it seems it take alot of time to parse pdf depending on pdf size 1MB=2sec , 5MB=15sec, 10MB=25sec and so one , please, advice how to improve this execution time?
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
PUT my-index-000001/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
Thanks
CodePudding user response:
Its an expensive operation and will cost resources, I would explore using FSCrawler ( https://fscrawler.readthedocs.io/en/fscrawler-2.9/) or other Tika library to off-load the whole operation from ES; You might be able to get lot of things done in parallel or process data before its ready to index.