Home > OS >  Azure Form Recognizer Not Finding Content with Python on Databricks
Azure Form Recognizer Not Finding Content with Python on Databricks

Time:05-21

I am executing the following Python on Databricks with the relevant Cognitive Form recognizer libraries:

from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import FormRecognizerClient
credential = AzureKeyCredential("aaa6123af5b843a38044538d95584c3d")
endpoint= "https://myformrecognizr.cognitiveservices.azure.com/"

form_recognizer_client = FormRecognizerClient(endpoint, credential)

with open("/dbfs/mnt/lake/RAW/export/Picturehouse.pdf", "rb") as fd:
    form = fd.read()

poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()

for content in form_pages:
    for table in content.tables:
        print("Table found on page {}:".format(table.page_number))
        print("Table location {}:".format(table.bounding_box))
        for cell in table.cells:
            print("Cell text: {}".format(cell.text))
            print("Location: {}".format(cell.bounding_box))
            print("Confidence score: {}\n".format(cell.confidence))

    if content.selection_marks:
        print("Selection marks found on page {}:".format(content.page_number))
        for selection_mark in content.selection_marks:
            print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
                selection_mark.state,
                selection_mark.bounding_box,
                selection_mark.confidence
            ))

The pdf form looks like the following:

enter image description here

The libraries recognizes Cell text: Item Cell text: Qty Cell text: Seat Allocation Cell text: Subtotal Cell text: Adult Cell text: 1 Cell text: D-11 Cell text: 14.50

But it doesn't recognize the following text from the pdf:

You can go straight to the screen by showing your e-ticket to an usher. Alternatively, you can collect your tickets at Box Office at least 15 minutes before the advertised start time of the film or event. You need your Booking Reference and/or payment card to help us find your booking. You can print this page by clicking the "Print This Page" link above.

Is that by design? Or am I missing something in my code?

CodePudding user response:

Unfortunately, the design is like that. The form recognizer is working on pre-trained models and that can recognize the key-value pairs, text, and tables from your documents and the table contents in the file uploaded as the input. Even though the file contains a large amount of text in paragraphs and table content in the middle or at any place, it will be recognized.

To know more details please Refer this link:

https://www.drware.com/extract-data-from-pdfs-using-form-recognizer-with-code-or-without/

https://www.youtube.com/watch?v=iBQO4QdUp6A&t=10s

https://github.com/tomweinandy/form_recognizer_demo

  • Related