Home > Back-end >  Azure Cognitive Search find similar documents use case
Azure Cognitive Search find similar documents use case

Time:07-15

Please help me understand if what I am trying to do is possible to implement with Azure Cognitive Search.

I have a bunch of PDF files extracted and indexed as text (so I don't use the OCR build-in feature for the index, I prepare extracted PDF data with third-party tools) and I need somehow implement the feature called "find me similar documents in the index based on a new one document".
So as an input parameter for the search, I pass the extracted PDF text (that usually looks like a mess with new line symbols) that I want to use to find similar extracted PDF files in my index. That means they have a similar structure/company names/people etc.
Is that possible to do? I can't find any similar cases described in the documentation, but I assume it could be somehow configured with a full query search.

Please advise me am I moving in the right direction at all?

CodePudding user response:

I think there are two possible ways:

1-Implement an enrichment process during the injection that will pre classify the content.

2-you use the semantic search feature and rely on it to return documents that are similar and relevant to the content you're searching for.

EDIT

I just noticed, there's a new feature called 'moreLikeThis' which is current in preview mode, but I believe it's what you're looking for:

https://docs.microsoft.com/en-us/azure/search/search-more-like-this

More info:

https://docs.microsoft.com/en-us/azure/search/semantic-search-overview

https://youtu.be/d_6ZNyV1MvA?t=619

CodePudding user response:

We have a procedure which can exclude the OCR operation is REST API calls implementation. In this implementation, we use the GET and POST methods. Follow the link for REST API implementation in search criteria.

  • Related