Home > Back-end >  split big pdf into multiple smaller pdfs of different page length based on spectific string appearan
split big pdf into multiple smaller pdfs of different page length based on spectific string appearan

Time:01-15

Problem

I have a long PDF file with many pages. I want that this pdf is splitted in many smaller files, which lenght is derived from the text content of the long pdf. You can imagine the string as something that activate a scissors that cut the long pdf and give even the filename to the smaller pdf.

The "scissors" strings are generated by the following iterator and are represented from "text":

for municipality in array_merged_zone_send:
    text = f'PANORAMICA DI {municipality.upper()}'

If I print ('text') in the iterator the result is that:

PANORAMICA DI BELLINZONA
PANORAMICA DI RIVIERA
PANORAMICA DI BLENIO
PANORAMICA DI ACQUAROSSA

The strings above are unique values, they appear only once. Above I have shown only the first four, there are more and EVERY item is written in the original pdf that I want to split. Every item appears only one time in the original pdf, no more than one, no less than one (match always one to one) and in the pdf there is never additional "PANORAMICA DI........" that is not already an item obtained by the iteration. PANORAMICA means OVERVIEW in English.

Here an example of the pages inside the original pdf where there is the string that come from item "PANORAMICA DI BLENIO" enter image description here

What I want to do: I want to split the original pdf every time that appears the string item. In the image above the original pdf have to be split in two: first pdf end in the page before "PANORAMICA DI BLENIO", second begins in page "PANORAMICA DI BLENIO" and will end in the page before the next "PANORAMICA DI {municipality.upper()}". The resulting pdf name is "zp_Blenio.pdf" for the second, for the first "zp_Acquarossa". For this it should be no problem because "municipality" when it is no upper() is already OK (in other words is "Acquarossa" and "Blenio")

Other example to understand with a simplified simulation (my file has more page): original pdf 12 pages long, pay attention that is not a code, but I put as a code to write well:

page 1: "PANORAMICA DI RIVIERA"
page 2: no match with "text" item
page 3: no match with "text" item 
page 4: "PANORAMICA DI ACQUAROSSA"
page 5: no match with "text" item 
page 6: "PANORAMICA DI BLENIO"
page 7: no match with "text" item 
page 8: no match with "text" item
page 9: no match with "text" item 
page 10: no match with "text" item
page 11: "PANORAMICA DI BELLINZONA"
page 12: no match with "text" item

results will be (again pay attention that is not a code, but I put as a code to show you well):

first created pdf is from page 1 to page 3
second created pdf is from page 4 to page 5
third pdf is from page 6 to 10
forth pdf is from page 11 to 12

the rule is like: split at the page when a text appears until the page before that the text appears again, split at the page when a text appears until the page before that the text appears again, and so on.

Take care: my original pdf is part of a long py code and the pdf changed every time, but the rule of "PANORAMICA DI ....." does not change. In other words, maybe the interval lenght of pages between "PANORAMICA DI ACQUAROSSA" and "PANORAMICA DI BLENIO" changes. This prevents to use a workaroung and set manually the interval of page to split ignoring the rules established above.

Attempt to solve the problem

The only one solution to this issue that I have found is a code that was obsolete and not checked by the author that can be found in this page: https://stackoverflow.com/a/62344714/13769033

I've taken the code and changed depending on the new functions and classes and integrating the iteration to obtain "text".

The result of the old code after my updating is the following:

from PyPDF2 import PdfWriter, PdfReader
import re

def getPagebreakList(file_name: str)->list:
    pdf_file = PyPDF2.PdfReader(file_name)
    num_pages = len(pdf_file.pages)
    page_breaks = list()
    for i in range(0, num_pages):
        Page = pdf_file.pages[i] 
        Text = PageObject.extract_text() 
        for municipality in array_merged_zone_send:
            text = f'PANORAMICA DI {municipality.upper()}'
            if re.search(text, Text):
                page_breaks.append(i)
    return page_breaks


inputpdf = PdfReader(open("./report1.pdf", "rb"))
num_pages = len(inputpdf.pages)
page_breaks = getPagebreakList("./report1.pdf")

i = 0
while (i < num_pages):
    if page_breaks:
        page_break = page_breaks.pop(0)
    else:
        page_break = num_pages
    output = PdfWriter()
    while (i != page_break   1):
        output.add_page(inputpdf.pages[i])
        i = i   1
    with open(Path('.')/f'zp_{municipality}.pdf',"wb") as outputStream:
        output.write(outputStream)

Unfortunately, I don't understand large part of the code.

From the part that I don't understand at all and I don't know if the author made an error:

  • the indentation of "output = PdfWriter()"
  • the "getPagebreakList('./report1.pdf')" where I put the same pdf that I want to split but where tha author put "getPagebreakList('yourPDF.pdf')" that was nevertheless different of PdfFileReader(open("80....pdf", "rb")). I assume that it should have written yourPDF.pdf for both.

To be noted: "./report1.pdf" is the path where there is the pdf to split and I am sure that is right.

The code is wrong, when I execute I obtain "TypeError: 'list' object is not callable".

I want that someone help me to find the solution. You can modified my updated code or suggest another way to solve. Thank you.

Suggestion to simulate

To simplify, at the beginning I suggest to consider a static string of your pdf (strings that is repeating every x pages) instead of part of an array.

In my case, I had considered:

Text = PageObject.extract_text() 
text = 'PANORAMICA'
if re.search(text, Text):
    page_breaks.append(i)

....and changed even the path for the output.

You can simply use a long pdf with repeating fixed text that appears periodically but in an irregular way (once after 3 pages, once every 5 pages and so on).

Only when you find the solution you can integrate the iteration for municipality. The integration of "municipality" on the text is only used to integrate the "municipality" in the name of the new pdf files. Using only "PANORAMICA" does not impact on the lenght of the page interval of the new pdf.

CodePudding user response:

My suggestion is to divide the problem into smaller ones, essentially using a divide and conquer approach_. By making single task functions debugging in case of mistakes should be easier. Notice that getPagebreakList is slightly different.

from PyPDF2 import PdfWriter, PdfReader


def page_breaks(pdf_r:PdfReader) -> dict:
    page_breaks = {}
    for i in range(len(pdf_r.pages)):
        pdf_text = pdf_r.pages[i].extract_text() 
        for municipality in array_merged_zone_send:
            pattern = f'PANORAMICA DI {municipality.upper()}'
            if re.search(pattern, pdf_text):
                page_breaks[municipality] = i
    return page_breaks
 

def filenames_range_mapper(pdf_r:PdfReader, page_indices:dict) -> dict:
    num_pages = list(page_indices.values())   [len(pdf_r.pages) 1] # add last page as well
    # slice the pages from the reader object
    return {name: pdf_r[start:end] for name, start, end in zip(page_indices, num_pages, num_pages[1:])}


def save(file_name:str, pdf_pages:list[PdfReader]) -> None:
    # pass the pages to the writer
    pdf_w = PdfWriter()
    for p in pdf_pages:
        pdf_w.add_page(p)
    
    # write to file
    with open(file_name, "wb") as outputStream:
        pdf_w.write(outputStream) 
    
    # message
    print(f'Pdf "{file_name}" created.')


# main
# ####
# initial location of the file
file_name = "./report1.pdf"
# create reader object
pdf_r = PdfReader(open(file_name, "rb"))
# get index locations of matches
page_breaks = page_breaks(pdf_r)
# dictionary of name-pages slice objects
mapper = filenames_range_mapper(pdf_r, page_breaks)

# template file name
template_output = './zp_{}.pdf'
# iterate over the location-pages mapper
for municipality, pages in ranges.items():
    # set file name
    new_file_name = template_output.format(municipality.title()) # eventually municipality.upper()
    # save the pages into a new file
    save(new_file_name, pages)

Test the code with auxiliary function to avoid unwanted output.

In this case it would be enough to consider a slightly different implementation of filenames_range_mapper in which the values will be just a list of integers (and not PdfReader objects).

def filenames_range_mapper_tester(pdf_r:PdfReader, page_indices:dict) -> dict:
    num_pages = list(page_indices.values())   [len(pdf_r.pages) 1] # add last page as well
    # slice the pages from the reader object
    return {name: list(range(len(pdf_r.pages)))[start,end] for name, start, end in zip(page_indices, num_pages, num_pages[1:])}

# auxiliary test
file_name = "./report1.pdf"
pdf_r = PdfReader(open(file_name, "rb"))
page_breaks = page_breaks(pdf_r)
mapper = filenames_range_mapper_tester(pdf_r, page_breaks)

template_output = './zp_{}.pdf'
for name, pages in mapper.items():
   print(template_output.format(name.title()), pages)

If the output make sense then you can proceed with the non-testing code.


An abstraction on how to get the right pages:

# mimic return of "page_breaks"
page_breaks = {
    "RIVIERA": 1,
    "ACQUAROSSA": 4,
    "BLENIO": 6,
    "BELLINZONA": 11
}

# mimic "filenames_range_mapper"
last_page_of_pdf = 12   1 # increment by 1 the number of pages of the pdf!

num_pages = list(page_breaks.values())   [last_page_of_pdf]
#[1, 4, 6, 11, 12]

mapper = {name: list(range(start, end)) for name, start, end in zip(page_breaks, num_pages, num_pages[1:])}
#{'RIVIERA': [1, 2, 3],
# 'ACQUAROSSA': [4, 5],
# 'BLENIO': [6, 7, 8, 9, 10],
# 'BELLINZONA': [11, 12]}

CodePudding user response:

@cards

I have make the test that you have suggested in your comments.

I post the code to be sure that I have tested in the right way. The code that I run is exactly that:

from PyPDF2 import PdfWriter, PdfReader


def page_breaks(pdf_r:PdfReader) -> dict:
    page_breaks = {}
    for i in range(len(pdf_r.pages)):
        pdf_text = pdf_r.pages[i].extract_text() 
        for municipality in array_merged_zone_send:
            pattern = f'PANORAMICA DI {municipality.upper()}'
            if re.search(pattern, pdf_text):
                page_breaks[municipality] = i
    return page_breaks

def filenames_range_mapper_tester(pdf_r:PdfReader, page_indices:dict) -> dict:
    num_pages = list(page_indices.values())   [len(pdf_r.pages) 1] # add last page as well
    # slice the pages from the reader object
    return {name: list(range(len(pdf_r.pages)))[start,end] for name, start, end in zip(page_indices, num_pages, num_pages[1:])}

# auxiliary test
file_name = "./report1.pdf"
pdf_r = PdfReader(open(file_name, "rb"))
page_breaks = page_breaks(pdf_r)
mapper = filenames_range_mapper_tester(pdf_r, page_breaks)

template_output = './zp_{}.pdf'
for name, pages in mapper.items():
   print(template_output.format(name.title()), pages)

The test has given an error in position "num_pages=.....".

File "<string>", line 15, in filenames_range_mapper_tester
TypeError: 'list' object is not callable

Some observation: I don't understand yet how in the abovementioned code you have:

def filenames_range_mapper_tester(pdf_r:PdfReader, page_indices:dict) -> dict:
    num_pages = list(page_indices.values())   [len(pdf_r.pages) 1]

I look everywhere in your code but I don't find trace of the definition of "page_indices" before. Where do you define "page_indices" before applying the method "values()"

In your answer you confirm that in abstraction page_indices=page_breaks and then why don't you insert in your code in that way:

def filenames_range_mapper_tester(pdf_r:PdfReader, page_breaks:dict) -> dict:
    num_pages = list(page_breaks.values())   [len(pdf_r.pages) 1] # add last page as well
    # slice the pages from the reader object
    return {name: list(range(len(pdf_r.pages)))[start,end] for name, start, end in zip(page_breaks, num_pages, num_pages[1:])}

If you use page_breaks this "page_breaks" is already defined from the result of the function def page_breaks and so you apply a method with something defined before. However even if I replace page_indices with page_breaks the error is the same.

  • Related