Home > Enterprise >  How to export captured data from PDF into a DataFrame? [RegEx]
How to export captured data from PDF into a DataFrame? [RegEx]

Time:04-26

import re
import pdfplumber 
import pandas as pd
from collections import namedtuple
path = r"C:\Users\x\Documents\Python Scripts\Files\x.pdf"
Line = namedtuple('Line', 'print_date order_no pos item_no issue qty UM price req_date line_amt')

with pdfp.open(path) as pdf:
    page = pdf.pages[2]
    text = page.extract_text()

new_vend_re = re.compile(r'^\d{1,}\s[A-Z].*')
for line in text.split('\n'):
    if new_vend_re.match(line):
        print(line)

This finds and prints something like this:

53 AB839-11 0002 31.00 EA 58.5300 1814.43

There are pages and pages that have to extract similar values like those. That's just an example. Remaining code that does the processing:

line_items = []

with pdfplumber.open(path) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            line = new_vend_re.search(line)
            if line:
                pos = line.group(1)
                item_no = line.group(2)
                issue = line.group(3)
                qty = line.group(4)
                UM = line.group(5)
                price = line.group(6)
                amt = line.group(7)
                line_items.append(Inv(pos, item_no, issue, qty, UM, price, amt))

df = pd.DataFrame(line_items)
df.head()

I've got this code but it doesn't seem to be able to put the extracted data into their respective tuples. My program should basically iterate through the PDF which has multiple pages and exact the values I get from the various extracted items from the regex expression and put them in the tuple but my code doesn't work for some reason.

CodePudding user response:

Your regex is wrong - it starts with "^\d " - meaning start of line followed by digits. The line in your file starts with "(......)" - change the regex:

import re
from collections import namedtuple

Inv = namedtuple('Inv', 'pos, item_no, issue, qty, UM, price, amt')

new_vend_re = re.compile(r'\d \s[A-Z].*')
text = "some\nmore (53 AB839-11 0002 31.00 EA 58.5300 1814.43) things \ntext\n"

line_items = []
for line in text.split('\n'):
    searched = new_vend_re.search(line)
    if searched:
        print(line)
        # get the matched part of the line and remove ( ) from start/end
        m = searched.group(0).strip("()")

        # now its as simple as splitting it into variables
        pos, item_no, issue, qty, UM, price, amt, *crap = m.split()

        # and use a namedtuple that works with that amount of data
        line_items.append(Inv(pos, item_no, issue, qty, UM, price, amt))

if crap:
    print(crap, "that was also captured but not used")

print(*line_items)

import pandas as pd
df = pd.DataFrame(line_items)
print(df.head())

Output:

# line
more (53 AB839-11 0002 31.00 EA 58.5300 1814.43) things 

# crap catchall
['things'] that was also captured but not used

# named tuple
Inv(pos='53', item_no='AB839-11', issue='0002', qty='31.00', UM='EA', price='58.5300', amt='1814.43)')

# df
  pos   item_no issue    qty  UM    price       amt
0  53  AB839-11  0002  31.00  EA  58.5300  1814.43)
  • Related