I'm using Camelot Python Library to read all tables in a page of pdf document
I'm tring to read all tables at page 10 in this pdf
I tried to debug plotting the page and I noticed something if I change the flavor:
This is with flavor lattice
This is with flavor stream
The problem is if I use lattice flavor it will not read properly the tables an example here
If I use flavor='stream', It will read data properly but just of one table: The output is somenthing like this.
I tried to use table_area/table_regions for detect the two tables with flavor='stream', but it didn't work. I paste the code down here.
Code with lattice:
import camelot
file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='lattice',edge_tool=1500)
print("Total tables extracted:", tables.n)
print(tables[0].df) camelot.plot(tables[0],filename="try_plot.png", kind='contour')
print(tables[1].df)
Code with stream, without table_area/table_regions:
import camelot
file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream', edge_tool=1500)
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')
Code with stream, with table_area:
import camelot
file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_area=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')
Code with stream, with table_regions:
import camelot
file = "2022/Auto-trend0122.pdf"
tables = camelot.read_pdf(file,pages='10',flavor='stream',edge_tool=1500,table_regions=['10,450,550,50','10,750,550,450'])
print("Total tables extracted:", tables.n)
print(tables[0].df)
camelot.plot(tables[0],filename="try_plot.png", kind='contour')
The output for table_regions/table_area/without is the same.
CodePudding user response:
The problem is that you are using table_area instead of the correct parameter table_areas
(read the docs).
The following command works perfectly:
tables = camelot.read_pdf(file,pages='10', flavor='stream', edge_tool=1500, table_areas=['10,450,550,50','10,750,550,450'])
Difference between table_areas and table_regions
table_areas
should be used when you know the exact position of the table. Conversely, table_regions
makes the detection engine look for tables only in those generic page regions.