Home > Back-end >  How can I limit pdfminer to read data in the cropbox or mediabox
How can I limit pdfminer to read data in the cropbox or mediabox

Time:07-26

If I have a simple code like this one:

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator


fp = open("my_pdf", 'rb')
rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)


for page in pages:
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text() 
            print('At %r is text: %s' % ((x, y), repr(text)))

How can I limit pdfminer to read the information in the cropbox or mediabox?

CodePudding user response:

Not sure if I understood correctly, but if you want to print text contained in a given area, you can use the coordinates returned by bbox to conditionnally print your ROI (region of interest).

For a given crop area (x0, y0, x1, y1) :

for page in pages:
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            if (
                x0 < lobj.bbox[0] 
                and x1 > lobj.bbox[2] 
                and y0 < lobj.bbox[1] 
                and y1 > lobj.bbox[3]
            ):
                x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text() 
                print('At %r is text: %s' % ((x, y), repr(text)))
  • Related