How to extract radiobutton / checkbox information with python from a pdf-file?-CodePudding

i would like to get the radio-button / checkbox information from a pdf-document - I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules.

I can parse the text using this code - but for the radio-buttons i get only the text - but no information which button (or checkbox) is selected.

import pdfplumber
import os
import sys

if __name__ == '__main__':
  path = os.path.abspath(os.path.dirname(sys.argv[0])) 
  fn = os.path.join(path, "input.pdf")
  pdf = pdfplumber.open(fn)
  page = pdf.pages[0]
  text = page.extract_text()

I have also uploaded an example file here: https://easyupload.io/8y8k2v

Is there any way to get this information from the pdf-file using python?

CodePudding user response：

It is many ways how to do that maybe you can detect the color of the pixel with minecart module

IF the pixel is blue it was marked if no it does not.

CodePudding user response：

I think i found a solution using pdfplumber - (probably not elegant - but i can check if the radio-buttons are selected or not)

Generally:

i read all chars and all curves for all pages
then i sort all elements by x and y (to get the chars and elements in the correct order like in the pdf)
then i concatenate the cars and add also blanks when the distance between the chars is longer than in a word
i check the pts-information for the carves and get so the information if the radio button is selected or not

the final lines and yes/not informatin i store in a list line-by-line for furhter working

import pdfplumber
import os
import sys

fn = os.path.join(path, "input.pdf")
  pdf = pdfplumber.open(fn)
  finalContent = []
    for idx,page in enumerate(pdf.pages, start=1):  
      print(f"Reading page {idx}")
      contList = []
      for e in page.chars:             
        tmpRow = ["char", e["text"], e["x0"], e["y0"]]
        contList.append(tmpRow)
      for e in page.curves:
        tmpRow = ["curve", e["pts"], e["x0"], e["y0"]]
        contList.append(tmpRow)  
      contList.sort(key=lambda x: x[2])
      contList.sort(key=lambda x: x[3], reverse=True)

      workContent = []    
      workText = ""
      workDistCharX = False
      for e in contList:
        if e[0] == "char":
          if workDistCharX != False and \
             (e[2] - workDistCharX > 20 or e[3] - workDistCharY < -2):
              workText  = " / "
          workText  = e[1]
          workDistCharX = e[2]
          workDistCharY = e[3]
          continue
        if e[0] == "curve":
          if workText != "":
            workContent.append(workText)
            workText = ""

          if e[1][0][0] < 100:
            tmpVal = "SELECT-YES"
          else:
            tmpVal = "SELECT-NO"

          workContent.append(f"CURVE {tmpVal}, None, None")

      finalContent.extend(workContent)
      workContent = "\n".join(workContent)