Home > Enterprise >  Python - Merge PDF files with same prefix using PyPDF2
Python - Merge PDF files with same prefix using PyPDF2

Time:11-04

I have multiple PDF files that have different prefixes. I want to merge these pdf files based on the third prefix (third value in the underscore). I want to do this using python library PyPDF2.

For example:

0_2021_1_123.pdf
0_2021_1_1234.pdf
0_2021_1_12345.pdf
0_2021_2_123.pdf
0_2021_2_1234.pdf
0_2021_2_12345.pdf

Expected outcome

1_merged.pdf
2_merged.pdf

Here is what i tried but i am getting an error and it is not working. Any help is much appreciated.

from PyPDF2 import PdfFileMerger
import io
import os
files = os.listdir("C:\\test\\raw")
x=0

merger = PdfFileMerger()
for filename in files:
    print(filename.split('_')[2])
    prefix = filename.split('_')[2]
    if filename.split('_')[2] == prefix:
        merger.append(filename)
    merger.write("C:\\test\\result"   prefix   "_merged.pdf")
    merger.close()

This is the error message

Traceback (most recent call last):
  File "C:/test2.py", line 12, in <module>
    merger.append(filename)
  File "C:\py\lib\site-packages\PyPDF2\merger.py", line 203, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "C:\py\lib\site-packages\PyPDF2\merger.py", line 114, in merge
    fileobj = file(fileobj, 'rb')

FileNotFoundError: [Errno 2] No such file or directory: '0_2021_564495_12345.pdf'

Process finished with exit code 1

CodePudding user response:

os.listdir() only lists filenames; it won't include the directory name.

To get the full path to actually add into the merger, you'll have to os.path.join() the root path back in.

However, you'll also need to note that the files you get from os.listdir() may not necessarily be in the order you want for your prefixes, so it'd be better to refactor things so you first group things by prefix, then process each prefix group:

from collections import defaultdict

from PyPDF2 import PdfFileMerger
import os

root_path = "C:\\test\\raw"
result_path = "C:\\test\\result"

files_by_prefix = defaultdict(list)
for filename in os.listdir(root_path):
    prefix = filename.split("_")[2]
    files_by_prefix[prefix].append(filename)

for prefix, filenames in files_by_prefix.items():
    result_name = os.path.join(result_path, prefix   "_merged.pdf")
    print(f"Merging {filenames} to {result_name} (prefix {prefix})")
    merger = PdfFileMerger()
    for filename in sorted(filenames):
        merger.append(os.path.join(root_path, filename))
    merger.write(os.path.join(result_path, f"{prefix}_merged.pdf"))
    merger.close()
  • Related