Home > Mobile >  Counting coloured pages in a PDF
Counting coloured pages in a PDF

Time:12-06

I have adapted this code from another StackOverflow post. It converts a PDF page to an Image and checks the Hue/Saturation values for colour. My only issue is that it is very slow, almost takes a minute for 25 pages. Does anyone have any ideas on how I can make it more efficient?

from pdf2image import convert_from_path
import numpy as np

def main():
    images = convert_from_path("example1.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
    sw=0
    color=0

    for image in images:
        img = np.array(image.convert('HSV'))
        hsv_sum = img.sum(0).sum(0)
        if hsv_sum[0] == 0 and hsv_sum[1] == 0:
            sw  = 1
        else:
            color  = 1
    print(color)
    print(sw)

CodePudding user response:

try use this

import PyPDF2

pdf_file = open('nama_file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

colored_page_count = 0

for page in pdf_reader.pages:
  if page.get("/ColorSpace") == "/DeviceRGB":
    colored_page_count  = 1

print(colored_page_count)

pdf_file.close()

CodePudding user response:

disclaimer I am the author of borb, the library used in this answer

Depending on what exactly is colored in the page, you could use borb to get this done.

borb has the concept of EventListener, which gets notified of rendering instructions (as they are coming out of the parser).

This should be as fast as simply reading the PDF.

  • Related