I've found the bbox coordinates in the lxml file and managed to extract the wanted data with PDFQuery. Then I write the data to a csv file.
def pdf_scrape(pdf):
"""
Extract each relevant information individually
input: pdf to be scraped
returns: dataframe of scraped data
"""
# Define coordinates of text to be extracted
CUSTOMER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 563.285, 624.656, 580.888")').text()
CUSTOMER_REF = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 534.939, 443.186, 552.542")').text()
SALES_ORDER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 504.692, 414.352, 522.295")').text()
ITEM_NUMBER = pdf.pq('LTTextLineHorizontal:overlaps_bbox("356.684, 478.246, 395.129, 495.849")').text()
KEY = '0000' SALES_ORDER '-' '00' ITEM_NUMBER
# Combine all relevant information into a single pandas dataframe
page = pd.DataFrame({
'KEY' : KEY,
'CUSTOMER' : CUSTOMER,
'CUSTOMER REF.': CUSTOMER_REF,
'SALES ORDER' : SALES_ORDER,
'ITEM NUMBER' : ITEM_NUMBER
}, index=[0])
return(page)
pdf_search = Path("files/").glob("*.pdf")
pdf_files = [str(file.absolute()) for file in pdf_search]
master = list()
for pdf_file in pdf_files:
pdf = pdfquery.PDFQuery(pdf_file)
pdf.load(0)
# Iterate over all pages in document and add scraped data to df
page = pdf_scrape(pdf)
master.append(page)
master = pd.concat(master, ignore_index=True)
master.to_csv('scraped_PDF_as_csv\scraped_PDF_DataFrame.csv', index = False)
The problem is that I need to read through hundres of PDFs each day, and this script takes ~13-14 seconds to mine four elements from the first page of only 10 PDFs.
Is there a way to speed up my code? I've looked at the this:
PyMuPDF runs both PDFs in almost the same time, and I think we're seeing PDFQuery taking longer to make those n**2/2
cross-comparisons.
I think you'll be giving up a lot of convenience to try and do this yourself. If your PDFs are consistent you could probably tune PyMuPDF and get it just right, but if there's variation as to how they were created it might take longer to get right (if even ever, because text in PDFs is deceptively tricky).