i would like to get the radio-button / checkbox information from a pdf-document - I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules.
I can parse the text using this code - but for the radio-buttons i get only the text - but no information which button (or checkbox) is selected.
import pdfplumber
import os
import sys
if __name__ == '__main__':
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fn = os.path.join(path, "input.pdf")
pdf = pdfplumber.open(fn)
page = pdf.pages[0]
text = page.extract_text()
I have also uploaded an example file here: https://easyupload.io/8y8k2v
Is there any way to get this information from the pdf-file using python?
CodePudding user response:
It is many ways how to do that maybe you can detect the color of the pixel with minecart module
IF the pixel is blue it was marked if no it does not.
CodePudding user response:
I think i found a solution using pdfplumber - (probably not elegant - but i can check if the radio-buttons are selected or not)
Generally:
i read all chars and all curves for all pages
then i sort all elements by x and y (to get the chars and elements in the correct order like in the pdf)
then i concatenate the cars and add also blanks when the distance between the chars is longer than in a word
i check the pts-information for the carves and get so the information if the radio button is selected or not
the final lines and yes/not informatin i store in a list line-by-line for furhter working
import pdfplumber import os import sys fn = os.path.join(path, "input.pdf") pdf = pdfplumber.open(fn) finalContent = [] for idx,page in enumerate(pdf.pages, start=1): print(f"Reading page {idx}") contList = [] for e in page.chars: tmpRow = ["char", e["text"], e["x0"], e["y0"]] contList.append(tmpRow) for e in page.curves: tmpRow = ["curve", e["pts"], e["x0"], e["y0"]] contList.append(tmpRow) contList.sort(key=lambda x: x[2]) contList.sort(key=lambda x: x[3], reverse=True) workContent = [] workText = "" workDistCharX = False for e in contList: if e[0] == "char": if workDistCharX != False and \ (e[2] - workDistCharX > 20 or e[3] - workDistCharY < -2): workText = " / " workText = e[1] workDistCharX = e[2] workDistCharY = e[3] continue if e[0] == "curve": if workText != "": workContent.append(workText) workText = "" if e[1][0][0] < 100: tmpVal = "SELECT-YES" else: tmpVal = "SELECT-NO" workContent.append(f"CURVE {tmpVal}, None, None") finalContent.extend(workContent) workContent = "\n".join(workContent)