If I have a simple code like this one:
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
fp = open("my_pdf", 'rb')
rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
for page in pages:
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
print('At %r is text: %s' % ((x, y), repr(text)))
How can I limit pdfminer to read the information in the cropbox or mediabox?
CodePudding user response:
Not sure if I understood correctly, but if you want to print text contained in a given area, you can use the coordinates returned by bbox
to conditionnally print your ROI (region of interest).
For a given crop area (x0, y0, x1, y1) :
for page in pages:
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
if (
x0 < lobj.bbox[0]
and x1 > lobj.bbox[2]
and y0 < lobj.bbox[1]
and y1 > lobj.bbox[3]
):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
print('At %r is text: %s' % ((x, y), repr(text)))