I have few chapters in pdf file showed in the image below. I want to extract the data (red line) after the bold sentences
My current code able to loop through all the pages and extract all text. But how to extract those after bold sentences?
Sample code:
import PyPDF2
filename = "test.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count = 1
text = pageObj.extractText()
print(text)
CodePudding user response:
Use pdfplumber
with char
properties.
import pdfplumber
with pdfplumber.open("test.pdf") as pdf:
for text in pdf.pages:
normal_text = text.filter(lambda obj: obj["object_type"] == "char" and not "Bold" in obj["fontname"])
print(normal_text.extract_text())
Note : fontname
is the name of the character's font face (e.g, Bold).