How to detect colored blocks in a PDF file with python (pdfminer, minecart, tabula...)-CodePudding

I am trying to extract quite a few tables from a PDF file. These tables are sort of conveniently "highlighted" with different colors, which makes it easy for eyes to catch (see the example screenshot).

I think it would be good to detect the position/coordinates of those colored blocks, and use the coordinates to extract tables.

I have figured out the table extraction part (using tabula-py). So it is the first step stopping me. From what I gathered minecart is the best tool for color and shapes in PDF files, except full scale imaging processing with OpenCV. But I have no luck with detecting colored box/block coordinates.

Would appreciate any help!!

example page1

CodePudding user response：

I think I got a solution:

import minecart

pdffile = open(fn, 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(page_num) # page_num is 0-based

for shape in page.shapes.iter_in_bbox((0, 0, 612, 792 )):
    if shape.fill: 
        shape_bbox = shape.get_bbox()
        shape_color = shape.fill.color.as_rgb()
        print(shape_bbox, shape_color)

I would then need to filter the color or the shape size...

My earlier failure was due to having used a wrong page number :(

CodePudding user response：

PyMuPDF lets you extract so-called "line art": the vector drawings on a page. This is a list of dictionaries of "paths" (as PDF calls interconnected drawings) from which you can sub-select ones of interest for you. E.g. the following identifies drawings that represent filled rectangles, not too small:

page = doc[0]  # load some page (here page 0)
paths = page.get_drawings()  # extract all vector graphics
filled_rects = [] # filled rectangles without border land here
for path in paths:
    if path["type"] != "f"  # only consider paths with a fill color
        continue
    rect = path["rect"]
    if rect.width < 20 or rect.height < 20:  # only consider sizable rects
        continue
    filled_rects.append(rect)  # hopefully an area coloring a table
# make a visible border around the hits to see success:
for rect in filled_rects:
    page.draw_rect(rect, color=fitz.pdfcolor["red"])
doc.save("debug.pdf")