Home > Mobile >  Python extract PDF data conditionally
Python extract PDF data conditionally

Time:09-23

I have few chapters in pdf file showed in the image below. I want to extract the data (red line) after the bold sentences
My current code able to loop through all the pages and extract all text. But how to extract those after bold sentences?

enter image description here enter image description here

Sample code:

import PyPDF2

filename = "test.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""

while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count  = 1
    text  = pageObj.extractText()

print(text)

CodePudding user response:

Use pdfplumber with char properties.

import pdfplumber

with pdfplumber.open("test.pdf") as pdf: 

    for text in pdf.pages:
        normal_text = text.filter(lambda obj: obj["object_type"] == "char" and not "Bold" in obj["fontname"])

        print(normal_text.extract_text())

Note : fontname is the name of the character's font face (e.g, Bold).

  • Related