How to trim (crop) bottom whitespace of a PDF document, in memory-CodePudding

I am using wkhtmltopdf to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I've failed to do so far) or render it incorrectly and trim it. I'm using Python.

Attempt type 1:

wkhtmltopdf render to a very, very long single-page PDF with a lot of extra space using --page-height
Use pdfCropMargins to trim: crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])

The PDF is rendered perfectly with 28 units of margin at the bottom, but I had to use the filesystem to execute the crop command. It seems that the tool expects an input file and output file, and also creates temporary files midway through. So I can't use it.

Attempt type 2:

wkhtmltopdf render to multi-page PDF with default parameters
Use PyPDF4 (or PyPDF2) to read the file and combine pages into a long, single page

The PDF is rendered fine-ish in most cases, however, sometimes a lot of extra white space can be seen on the bottom if by chance the last PDF page had very little content.

Ideal scenario:

The ideal scenario would involve a function that takes HTML and renders it into a single-page PDF with the expected amount of white space at the bottom. I would be happy with rendering the PDF using wkhtmltopdf, since it returns bytes, and later processing these bytes to remove any extra white space. But I don't want to involve the file system in this, as instead, I want to perform all operations in memory. Perhaps I can somehow inspect the PDF directly and remove the white space manually, or do some HTML magic to determine the render height before-hand?

What am I doing now:

Note that pdfkit is a wkhtmltopdf wrapper

# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")

# This is now valid HTML
rendered = template.render({
    "foo": "bar",
})

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

It's equivalent to Attempt type 2, except I don't use PyDPF4 here to stitch the pages together, but instead render again with wkhtmltopdf using precomputed page height.

CodePudding user response：

There might be better ways to do this, but 3 days into a bounty with no answers, this at least works. I'm assuming that you are able to crop the PDF yourself, and all I'm doing here is determining how far down on the last page you still have content. If that assumption is wrong, I could probably figure out how to crop the PDF. Or otherwise, just crop the image (easy in Pillow) and then convert that to PDF? Also, if you have one big PDF, you might need to figure how how far down on the whole PDF the text ends. I'm just finding out how far down on the last page the content ends. But converting from one to the other is like just an easy arithmetic problem.

Tested code:

import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO

# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz

# `pip install Pillow==8.3.1`
from PIL import Image

import numpy as np

# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"

# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
    "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
    "page-width": "210mm"
})

# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")

image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)

#Uncomment if you want to see.
#image.show()

# Now figure out where the end of the text is:

# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel, 
# We convert the result to a true/false matrix 
# And then invert that. 
# The upshot is that, at the end, a value of "True" 
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))

# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape

last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
    if any(row):
        last_row = i
        break
    else:
        continue 

percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)

# Now you know where the page ends.
# Go back and crop the PDF accordingly.