Home > Mobile >  Extracting text in known bbox from pdf, PDFQuery too slow
Extracting text in known bbox from pdf, PDFQuery too slow

Time:06-08

I've found the bbox coordinates in the lxml file and managed to extract the wanted data with PDFQuery. Then I write the data to a csv file.

def pdf_scrape(pdf):
    """
    Extract each relevant information individually
    input: pdf to be scraped
    returns: dataframe of scraped data
    """
    # Define coordinates of text to be extracted
    CUSTOMER             = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 563.285, 624.656, 580.888")').text() 
    CUSTOMER_REF         = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 534.939, 443.186, 552.542")').text()
    SALES_ORDER          = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 504.692, 414.352, 522.295")').text()
    ITEM_NUMBER          = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 478.246, 395.129, 495.849")').text()
    KEY                  = '0000'  SALES_ORDER   '-'   '00'   ITEM_NUMBER
    # Combine all relevant information into a single pandas dataframe
    page = pd.DataFrame({
        'KEY'          : KEY,
        'CUSTOMER'     : CUSTOMER,
        'CUSTOMER REF.': CUSTOMER_REF,
        'SALES ORDER'  : SALES_ORDER,
        'ITEM NUMBER'  : ITEM_NUMBER
                       }, index=[0])
    return(page)

pdf_search = Path("files/").glob("*.pdf")

pdf_files = [str(file.absolute()) for file in pdf_search]

master = list()
for pdf_file in pdf_files: 
    pdf = pdfquery.PDFQuery(pdf_file)
    pdf.load(0)

# Iterate over all pages in document and add scraped data to df
    page = pdf_scrape(pdf) 
    master.append(page)

master = pd.concat(master, ignore_index=True)
master.to_csv('scraped_PDF_as_csv\scraped_PDF_DataFrame.csv', index = False)

The problem is that I need to read through hundres of PDFs each day, and this script takes ~13-14 seconds to mine four elements from the first page of only 10 PDFs.

Is there a way to speed up my code? I've looked at the this: simple complicated PDFQuery timing (s) 0.123 0.258 PyMuPDF timing (s) 0.069 0.070

PyMuPDF runs both PDFs in almost the same time, and I think we're seeing PDFQuery taking longer to make those n**2/2 cross-comparisons.

I think you'll be giving up a lot of convenience to try and do this yourself. If your PDFs are consistent you could probably tune PyMuPDF and get it just right, but if there's variation as to how they were created it might take longer to get right (if even ever, because text in PDFs is deceptively tricky).

  • Related