Home > other >  Extract table from PDF - text in different rows
Extract table from PDF - text in different rows

Time:01-03

I have a bunch of PDFs like this: enter image description here

When I use :

df = tabula.read_pdf('045632_2023.pdf', pages='all', lattice=True)

df[0]

This partially solves the problem; see row 1 for the first column where the question is now in one cell only. However, there are still plenty of sentences/cells that are split into several rows.

enter image description here

Please, how would you solve this problem? I need to extract that table and have it as CSV. I need CSVs that look like the table in the PDF.

CodePudding user response:

It's somewhat understandable that the other solution used the PDFs as images since, as explained in the op1

[Once you have the DataFrame, you can of course use .to_csv to save as CSV.]


getRows_fromPdfDoc is also capable of extracting a table across multiple pages [as long as it has the same header in every page]:

csSamps = [ 'Common Medical Event', 'Services You May Need', '$30 copayment', 
            'Non-Preferred: 40% coinsurance', 'Limitations, Exceptions, & Other']
headTxt = ' '.join(csSamps[:2]   ['What You Will Pay']   csSamps[-1:])
rList = getRows_fromPdfDoc('045632_2023.pdf', headTxt, csSamps, 2)

But at this point, you might notice that merged cells are left empty except the first row. If it's just merged across rows, it's not difficult to fill them up from the previous rows:

prev_c1 = ''
for ri, r in enumerate(rList):
    if r['col_1']: prev_c1 = r['col_1']
    else: rList[ri]['col_1'] = prev_c1

However, if they're merged across columns or split in any way, then the merged columns are split back up, and partially filled rows are added for the split row. [I expect splits across columns will remain entirely undetected and the contents will be merged back into a single cell.]

Also, note that there's no way to extract nested tables with the current functions, although it's definitely not impossible. I might be able to com up with more reliable methods if I can figure out how to detect the background color of the characters and when they cross a cell border....

  • Related