I have adapted this code from another StackOverflow post. It converts a PDF page to an Image and checks the Hue/Saturation values for colour. My only issue is that it is very slow, almost takes a minute for 25 pages. Does anyone have any ideas on how I can make it more efficient?
from pdf2image import convert_from_path
import numpy as np
def main():
images = convert_from_path("example1.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
sw=0
color=0
for image in images:
img = np.array(image.convert('HSV'))
hsv_sum = img.sum(0).sum(0)
if hsv_sum[0] == 0 and hsv_sum[1] == 0:
sw = 1
else:
color = 1
print(color)
print(sw)
CodePudding user response:
try use this
import PyPDF2
pdf_file = open('nama_file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
colored_page_count = 0
for page in pdf_reader.pages:
if page.get("/ColorSpace") == "/DeviceRGB":
colored_page_count = 1
print(colored_page_count)
pdf_file.close()
CodePudding user response:
disclaimer I am the author of borb
, the library used in this answer
Depending on what exactly is colored in the page, you could use borb
to get this done.
borb
has the concept of EventListener
, which gets notified of rendering instructions (as they are coming out of the parser).
This should be as fast as simply reading the PDF.