I'm trying to automate the Azure Form Recognizer process using Databricks. I would put my pdf or jpg files in the blob and run a code in Databricks that will send the files to Form Recognizer, perform the data recognition and put the results in a new csv file in the blob.
Here is the code:
1. Install packages to cloud
%pip install azure.storage.blob
%pip install azure.ai.formrecognizer
2. Connect to Azure Storage Container
from azure.storage.blob import ContainerClient
container_url = "https://nameofmystorageaccount.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)
for blob in container.list_blobs():
blob_url = container_url "/" blob.name
print(blob_url)
3. Enable Cognitive Services
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
endpoint = "https://nameofmyendpoint.cognitiveservices.azure.com/"
key = "nameofmykey"
form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))
4. Send files to Cognitive Services
import pandas as pd
field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url "/" blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " blob.name "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
The first three steps run without problems, but I get the following error message on the forth step: (InvalidImage) The input data is not a valid image or password protected
.
This error concerns line poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
The PDFs in my blob have no password and run correctly in the Form recognizer when I do the process manually. I only use two small PDFs for my test, and I tried with different PDFs.
I use the free subscription of Azure. My Databricks cluster has unrestricted policy, with a single node cluster mode. The runtime version is 9.1 LTS (Apache 3.1.2, Scala 2.12). The public access level of my container is set to "Container".
Is there any configuration that I need to change in order run the code without error?
Thank you and have a nice day
CodePudding user response:
In my opinion url is not publicly available and it can not be downloaded correctly.
Best way is to pass whole document and use different methof:
form_recognizer_client = FormRecognizerClient(endpoint, credential)
with open("<path to your invoice>", "rb") as fd:
invoice = fd.read()
poller = form_recognizer_client.begin_recognize_invoices(invoice)
result = poller.result()
CodePudding user response:
Thank you for your response Hubert.
It turned out that in my blob cluster, I had two PDFs files, but also a saved form recognizer custom model. I deleted the .fott and .json files to only have the PDF in the cluster. I ran the code without any problem after.
Thank you