I would like to parse form fields from signed PDF's. With this I mean for example the checkboxes. I have already tried different ways (with Python) like PyPDF2, pikepdf or even pdfminer, however I only get the letters out and not the form fields. If someone has an approach how I could parse form fields from signed PDFs it would be my salvation. I can parse the individual letters, but not the form fields. I'm already thinking about trying OCR, but it seems very complicated to me and it might be easier.
Does anyone of you have an idea how I can parse the form fields out of signed PDF?
Thanks in advance!
CodePudding user response:
disclaimer: I am the author of borb
, the library used in this answer.
It's unclear what you want precisely.
- You want to extract information from the form fields in the PDF
- Your PDF is signed and then scanned, you want to extract an image of the signature
Either option is possible using borb
If you want to extract information of the form fields, I would recommend you look at section 4.4 of the examples repository. I'll post the example here for the sake of completeness.
from decimal import Decimal
from borb.pdf import HexColor
from borb.pdf import PageLayout
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PDF
def main():
# open document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
assert doc is not None
# get
print("Name: %s" % doc.get_page(0).get_form_field_value("name"))
print("Firstname: %s" % doc.get_page(0).get_form_field_value("firstname"))
print("Country: %s" % doc.get_page(0).get_form_field_value("country"))
if __name__ == "__main__":
main()
This example reads an input PDF, and then fetches the values of the form fields.
You can also do more low-level manipulations, borb
represents the PDF as a JSON-like datastructure (nested arrays, dictionaries and primitives). So you can get the information relatively easily.
If you want to apply OCR to a PDF, I would recommend yet another example in the examples repository. This time in section 7.2.
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from pathlib import Path
def main():
# set up everything for OCR
tesseract_data_dir: Path = Path("/home/joris/Downloads/tessdata-master/")
assert tesseract_data_dir.exists()
l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)
# read Document
doc: typing.Optional[Document] = None
with open("output_001.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
assert doc is not None
# store Document
with open("output_002.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
if __name__ == "__main__":
main()
CodePudding user response:
You can extract (but also manipulate) Form Fields with PyMuPDF - whether signed or not:
import fitz # the PyMuPDF package
doc = fitz.open("your.pdf")
for page in doc: # iterate over pages
print()
print(f"Form fields on page {page.number}")
for field in page.widgets(): # iterate over form fields on the page
print(f"field type '{field.field_type_string}', value '{field.field_value}`")