Home > Back-end >  How can I extract Text in real-time for Azure Cognitive Services?
How can I extract Text in real-time for Azure Cognitive Services?

Time:10-21

I'm looking to build an app which performs various text based Cognitive Services functions against documents.

However, I seem to be failing at the first hurdle which is getting the text from the documents in the first place.

I'm aware that both OCR and Form Recogniser both perform variations on this ("Text Recognition" and "Text Extraction" respectively) - but for standard documents (e.g. Word / Excel / PDF) this feels like massive overkill.

Cognitive Search includes the "document cracking" process - but I need to process the documents in real-time so don't want to have to deal with Indexes in Azure.

Is there a more simple "get me the text" functionality in Azure (either in Cognitive Services or otherwise) I can use for this?

What I don't really want to have to do do, is have to write my own functions for each different file type (e.g. PDF / DOCX / TXT / PNG / MSG) and work out which API I need to use for each one.

Thank you in advance!

CodePudding user response:

AFAIK, there's no ready to use tool besides Document Extraction from Cognitive Skill (Azure Cognitive Search):

https://docs.microsoft.com/en-us/azure/search/cognitive-search-skill-document-extraction.

You can also build your own pipeline to extract the text using Tika.NET:

https://github.com/KevM/tikaondotnet

  • Related