Saving a redacted PDF file in Python to mask underneath text-CodePudding

I read in a PDF file in Python, added a text box on top of the text that I'd like to redact, and saved the change in a new PDF file. When I searched for the text in the redacted PDF file using a PDF reader, the text can still be found.

Is there a way to save the PDF as a single layer file? Or is there a way to ensure that the text under the text box can be removed?

import PyPDF2 
import re
import fitz 
import io
import os
import pandas
import numpy as np

from PyPDF2 import PdfFileReader, PdfFileWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.graphics import renderPDF
from reportlab.lib import colors
from reportlab.graphics.shapes import *

reader = PyPDF2.PdfReader(files)
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize = A4)
can.rect(65, 750, 40, 30, stroke=1, fill=1) 
can.setFillColorRGB(1, 1, 1)
can.save()

packet.seek(0)
new_pdf = PdfFileReader(packet)
output = PyPDF2.PdfFileWriter() 
            
pageToOutput = reader.getPage(1)
pageToOutput.mergePage(new_pdf.getPage(0))
output.addPage(pageToOutput)

outputStream = open('NewFile.pdf', "wb")
output.write(outputStream)
outputStream.close()

CodePudding user response：

I used one of the solutons (pdf2image and PIL) in the link provided by @Matt Pitken, and it worked well.

CodePudding user response：

Disclaimer: I am the author of borb, the library used in this answer

Redaction in PDF is done through annotations. You can think of annotations as "something I added later to the PDF". For instance a post-it note with a remark.

Redaction annotations are basically a post-it with the implied meaning "this content needs to be removed from the PDF"

In borb, you can add redaction annotations and then apply them. This is purposefully a two-step process. The idea being that you can send the document (with annotations) to someone else, and ask them to review it (e.g. "Did I remove all the content that needed to be removed?)

Once your document is ready, you can apply the redaction annotations which will effectively remove the content.

Step 1 (creating a PDF with content, and redaction annotations):

from decimal import Decimal

from borb.pdf.canvas.layout.annotation.redact_annotation import RedactAnnotation
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.pdf import SingleColumnLayout
from borb.pdf import PageLayout
from borb.pdf import Paragraph
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PDF


def main():

    doc: Document = Document()

    page: Page = Page()
    doc.add_page(page)

    layout: PageLayout = SingleColumnLayout(page)

    layout.add(
        Paragraph(
            """
                        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
                        Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
                        Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
                        Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                        """
        )
    )

    page.add_annotation(
        RedactAnnotation(
            Rectangle(Decimal(405), Decimal(721), Decimal(40), Decimal(8)).grow(
                Decimal(2)
            )
        )
    )

    # store
    with open("output.pdf", "wb") as out_file_handle:
        PDF.dumps(out_file_handle, doc)


if __name__ == "__main__":
    main()

Of course, you can simply open an existing PDF and add a redaction annotation.

Step 2 (applying the redaction annotation):

import typing
from borb.pdf import Document
from borb.pdf import PDF


def main():

    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle)

    # apply redaction annotations
    doc.get_page(0).apply_redact_annotations()

    # store
    with open("output.pdf", "wb") as out_file_handle:
        PDF.dumps(out_file_handle, doc)


if __name__ == "__main__":
    main()