Extracting tables from PDF using tabula-py fails to properly detect rows-CodePudding

Problem

I want to extract a 70-page vocabulary table from a PDF and turn it into a CSV to use in [any vocabulary learning app]. Tabula-py and its read_pdf function is a popular solution to extract the tables, and it did detect the columns ideally without any fine-tuning. But, it only detected the columns well and had difficulties with the multi-line rows, splitting each line into a different row.

E.g., in the PDF you will have columns 2 and 3. The table on Stackoverflow doesn't seem to allow multi-line content either, so I added row numbers. Just merge the row 1 in your head.

Row number	German	Latin
1	First word	Translation for first word
1	with many lines of content	[phonetic vocabulary thingy]
1	and more lines
2	Second word	Translation for second word

Instead of fine-tuning the read_pdf parameters, are there ways around that?

CodePudding user response：

Possible solution

Instead of experimenting with tabula-py, which is perfectly legit of course, you can export a pdf in Adobe Reader using File->Export a PDF->HTML Web Page

You then read it using

import pandas as pd

dfs = pd.read_html("file.html", header=0,encoding='utf-8')

to get a list of pandas dataframes. You could also use BeautifulSoup4 or similar solutions to extract the tables.

To match tables with the same column names (e.g., in a vocabulary table) and save them as csv, you can do this:


from collections import defaultdict
unique_columns_to_dataframes = defaultdict(list)

# We need to get a hashable key for the dictionary, so we join the df.columns.values. Strings can be hashed.
possible_column_variations = [("%%".join(list(df.columns.values)), i) for i, df in enumerate(dfs)]

for k, v in possible_column_variations:
    unique_columns_to_dataframes[k].append(v)
    
for k, v in unique_columns_to_dataframes.items():
    new_df = pd.concat([dfs[i] for i in v])
    new_df.reset_index(drop=True,inplace=True)
    # Save file with a unique name. Unique name is a hash out from the characters in the column_names, not collision-free but unlikely to collide for small number of tables
    new_df.to_csv("Df_" str(sum([ord(c) for c in k])) ".csv", index=False, sep=";", encoding='utf-8')

CodePudding user response：

You may want to use PyMuPDF. As your table cells are wrapped by lines, this is a relatively easy case. I have published a script to answer a similar question here.