I need to extract tabular data from pdfs. Some tables in the pdf comprise of only a single row. I have been trying to extract the data using camelot library.
Code for extraction using Camelot:
pip install camelot-py[cv] tabula-py here
import camelot
file = 'xyz.pdf'
tables = camelot.read_pdf(file,pages ="all")
tables[6].df
The above code is not able to extract a single row table info.
For instance, in the pdf: https://www.nirfindia.org/nirfpdfcdn/2022/pdf/Engineering/IR-E-U-0306.pdf, the tool is not able to detect the last table(under the heading Faculty Details) as it consists of only one row.
Can someone suggest a workaround?
CodePudding user response:
As you can understand from the docs,
if you want to detect smaller lines, you should increase line_scale
parameter (default: 15).
In your case, this command works fine:
tables = camelot.read_pdf(file, pages ="all", line_scale=80)