Problem
I want to extract a 70-page vocabulary table from a PDF and turn it into a CSV to use in [any vocabulary learning app]. Tabula-py and its read_pdf function is a popular solution to extract the tables, and it did detect the columns ideally without any fine-tuning. But, it only detected the columns well and had difficulties with the multi-line rows, splitting each line into a different row.
E.g., in the PDF you will have columns 2 and 3. The table on Stackoverflow doesn't seem to allow multi-line content either, so I added row numbers. Just merge the row 1 in your head.
Row number | German | Latin |
---|---|---|
1 | First word | Translation for first word |
1 | with many lines of content | [phonetic vocabulary thingy] |
1 | and more lines | |
2 | Second word | Translation for second word |
Instead of fine-tuning the read_pdf parameters, are there ways around that?
CodePudding user response:
Possible solution
Instead of experimenting with tabula-py, which is perfectly legit of course, you can export a pdf in Adobe Reader using File->Export a PDF->HTML Web Page
You then read it using
import pandas as pd
dfs = pd.read_html("file.html", header=0,encoding='utf-8')
to get a list of pandas dataframes. You could also use BeautifulSoup4 or similar solutions to extract the tables.
To match tables with the same column names (e.g., in a vocabulary table) and save them as csv, you can do this:
from collections import defaultdict
unique_columns_to_dataframes = defaultdict(list)
# We need to get a hashable key for the dictionary, so we join the df.columns.values. Strings can be hashed.
possible_column_variations = [("%%".join(list(df.columns.values)), i) for i, df in enumerate(dfs)]
for k, v in possible_column_variations:
unique_columns_to_dataframes[k].append(v)
for k, v in unique_columns_to_dataframes.items():
new_df = pd.concat([dfs[i] for i in v])
new_df.reset_index(drop=True,inplace=True)
# Save file with a unique name. Unique name is a hash out from the characters in the column_names, not collision-free but unlikely to collide for small number of tables
new_df.to_csv("Df_" str(sum([ord(c) for c in k])) ".csv", index=False, sep=";", encoding='utf-8')
CodePudding user response:
You may want to use PyMuPDF. As your table cells are wrapped by lines, this is a relatively easy case. I have published a script to answer a similar question here.