Home > Blockchain >  Azure Databricks and Form Recognizer - Invalid Image or password protected
Azure Databricks and Form Recognizer - Invalid Image or password protected

Time:12-16

I'm trying to automate the Azure Form Recognizer process using Databricks. I would put my pdf or jpg files in the blob and run a code in Databricks that will send the files to Form Recognizer, perform the data recognition and put the results in a new csv file in the blob.

Here is the code:

1. Install packages to cloud

%pip install azure.storage.blob
%pip install azure.ai.formrecognizer


2. Connect to Azure Storage Container

from azure.storage.blob import ContainerClient

container_url = "https://nameofmystorageaccount.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)

for blob in container.list_blobs():
    blob_url = container_url   "/"   blob.name
    print(blob_url)


3. Enable Cognitive Services


import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential

endpoint = "https://nameofmyendpoint.cognitiveservices.azure.com/"
key = "nameofmykey"

form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))



4. Send files to Cognitive Services

import pandas as pd

field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
df = pd.DataFrame(columns=field_list)

for blob in container.list_blobs():
    blob_url = container_url   "/"   blob.name
    poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
    invoices = poller.result()
    print("Scanning "   blob.name   "...")

    for idx, invoice in enumerate(invoices):
        single_df = pd.DataFrame(columns=field_list)
        
        for field in field_list:
            entry = invoice.fields.get(field)
            
            if entry:
                single_df[field] = [entry.value]
                
            single_df['FileName'] = blob.name
            df = df.append(single_df)
            
df = df.reset_index(drop=True)
df

The first three steps run without problems, but I get the following error message on the forth step: (InvalidImage) The input data is not a valid image or password protected. This error concerns line poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)

The PDFs in my blob have no password and run correctly in the Form recognizer when I do the process manually. I only use two small PDFs for my test, and I tried with different PDFs.

I use the free subscription of Azure. My Databricks cluster has unrestricted policy, with a single node cluster mode. The runtime version is 9.1 LTS (Apache 3.1.2, Scala 2.12). The public access level of my container is set to "Container".

Is there any configuration that I need to change in order run the code without error?

Thank you and have a nice day

CodePudding user response:

In my opinion url is not publicly available and it can not be downloaded correctly.

Best way is to pass whole document and use different methof:

form_recognizer_client = FormRecognizerClient(endpoint, credential)

with open("<path to your invoice>", "rb") as fd:
    invoice = fd.read()

poller = form_recognizer_client.begin_recognize_invoices(invoice)
result = poller.result()

CodePudding user response:

Thank you for your response Hubert.

It turned out that in my blob cluster, I had two PDFs files, but also a saved form recognizer custom model. I deleted the .fott and .json files to only have the PDF in the cluster. I ran the code without any problem after.

Thank you

  • Related