Home > database >  Taking data from pdf file
Taking data from pdf file

Time:04-15

I have a table in a pdf document. there is x, y, and z columns. I want to take x column only. is it possible using python? if so, how?

Then i want to plot x versus y. how do i do that (using data from table)

CodePudding user response:

Step 1:Using tabula-py

The methods used in the example are :

read_pdf(): reads the data from the tables of the PDF file of the given address

tabulate(): arranges the data in a table format

Code

from tabula import read_pdf
from tabulate import tabulate
 
#reads table from pdf file
df = read_pdf("filename.pdf",pages="all") #address of pdf file
print(tabulate(df))

Now you get output

CodePudding user response:

You can try to implement it with tabula. It has python wrapper that can read tables from PDF and convert them to pandas DataFrame.

Tabula: https://tabula.technology/

Python wrapper: https://pypi.org/project/tabula-py/

CodePudding user response:

Check out a PDF parsing library tabula-py, it extracts tables from PDF documents.

Then parse your table

import tabula
table = tabula.read_pdf(<file path> ,pages=<number of pages>)

replace and with respective values.

It should return a list of pandas dataframe, which you can further use to extract the column that you need.

  • Related