Home > database >  How to extract radiobutton / checkbox information with python from a pdf-file?
How to extract radiobutton / checkbox information with python from a pdf-file?

Time:09-10

i would like to get the radio-button / checkbox information from a pdf-document - I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules.

I can parse the text using this code - but for the radio-buttons i get only the text - but no information which button (or checkbox) is selected.

import pdfplumber
import os
import sys

if __name__ == '__main__':
  path = os.path.abspath(os.path.dirname(sys.argv[0])) 
  fn = os.path.join(path, "input.pdf")
  pdf = pdfplumber.open(fn)
  page = pdf.pages[0]
  text = page.extract_text()

enter image description here enter image description here

I have also uploaded an example file here: https://easyupload.io/8y8k2v

Is there any way to get this information from the pdf-file using python?

CodePudding user response:

It is many ways how to do that maybe you can detect the color of the pixel with minecart module

IF the pixel is blue it was marked if no it does not.

CodePudding user response:

I think i found a solution using pdfplumber - (probably not elegant - but i can check if the radio-buttons are selected or not)

Generally:

  • i read all chars and all curves for all pages

  • then i sort all elements by x and y (to get the chars and elements in the correct order like in the pdf)

  • then i concatenate the cars and add also blanks when the distance between the chars is longer than in a word

  • i check the pts-information for the carves and get so the information if the radio button is selected or not

  • the final lines and yes/not informatin i store in a list line-by-line for furhter working

    import pdfplumber
    import os
    import sys
    
    fn = os.path.join(path, "input.pdf")
      pdf = pdfplumber.open(fn)
      finalContent = []
        for idx,page in enumerate(pdf.pages, start=1):  
          print(f"Reading page {idx}")
          contList = []
          for e in page.chars:             
            tmpRow = ["char", e["text"], e["x0"], e["y0"]]
            contList.append(tmpRow)
          for e in page.curves:
            tmpRow = ["curve", e["pts"], e["x0"], e["y0"]]
            contList.append(tmpRow)  
          contList.sort(key=lambda x: x[2])
          contList.sort(key=lambda x: x[3], reverse=True)
    
          workContent = []    
          workText = ""
          workDistCharX = False
          for e in contList:
            if e[0] == "char":
              if workDistCharX != False and \
                 (e[2] - workDistCharX > 20 or e[3] - workDistCharY < -2):
                  workText  = " / "
              workText  = e[1]
              workDistCharX = e[2]
              workDistCharY = e[3]
              continue
            if e[0] == "curve":
              if workText != "":
                workContent.append(workText)
                workText = ""
    
              if e[1][0][0] < 100:
                tmpVal = "SELECT-YES"
              else:
                tmpVal = "SELECT-NO"
    
              workContent.append(f"CURVE {tmpVal}, None, None")
    
          finalContent.extend(workContent)
          workContent = "\n".join(workContent)
    
  • Related