Taking data from pdf file-CodePudding

I have a table in a pdf document. there is x, y, and z columns. I want to take x column only. is it possible using python? if so, how?

Then i want to plot x versus y. how do i do that (using data from table)

CodePudding user response：

Step 1:Using tabula-py

The methods used in the example are :

read_pdf(): reads the data from the tables of the PDF file of the given address

tabulate(): arranges the data in a table format

Code

from tabula import read_pdf
from tabulate import tabulate
 
#reads table from pdf file
df = read_pdf("filename.pdf",pages="all") #address of pdf file
print(tabulate(df))

Now you get output

CodePudding user response：

You can try to implement it with tabula. It has python wrapper that can read tables from PDF and convert them to pandas DataFrame.

Tabula: https://tabula.technology/

Python wrapper: https://pypi.org/project/tabula-py/

CodePudding user response：

Check out a PDF parsing library tabula-py, it extracts tables from PDF documents.

Then parse your table

import tabula
table = tabula.read_pdf(<file path> ,pages=<number of pages>)

replace and with respective values.

It should return a list of pandas dataframe, which you can further use to extract the column that you need.