split combined pdf into original documents-CodePudding

Is there a way of identifying individual documents in a combined pdf and split it accordingly?

The pdf I am working on contains combined scans (with OCR, mostly) of individual documents. I would like to split it back into the original documents.

These original documents are of unstandardised length and size (hence, adobe's split by "Number of pages" or "File Size" are not an option). The "Top level bookmarks" seem to correspond to something different than individual documents, so splitting on them does not provide a useful result either.

I've created an xml version of the file. I'm not too familiar with it but having looked at it, I couldn't identify a standardised tag or something similar that indicates the start of a new document.

The answer to this question requires control over the merging process (which I don't have), while the answer to this question does not work because I have no standardised keyword on which to split.

Eventually, I would like to do this split for a few hundred pdfs. An example of a pdf to be split can be found here.

CodePudding user response：

As per discussions in comments one course of action is to parse the pages information (MediaBox) via python. However I prefer a few fast cmd line commands rather than write and test a heavier solution on this lightweight netbook.

Thus I would build a script to handle a loop of files and pass to windows console the files using Xpdf command line tools

Edit Actually most Python libs tend to include the poppler version (2022-01) of pdfinfo so you should be able to call or request feedback from that variant via your libs.

Using PDFinfo on your file and limit it to first 20 pages for a quick test is

pdfinfo -f 1 -l 20 yourfile.pdf and the response will be a text output suitable for comparison:-

Title:          Microsoft Word - 20190702_Revision_CO2_Verordnung_Detailkommenta
re_SWISS_final
Subject:
Keywords:
Author:         heim
Creator:        PDF24 Creator
Producer:       GPL Ghostscript 9.25
CreationDate:   Thu Jul 18 17:36:26 2019
ModDate:        Thu Jul 18 17:36:26 2019
Tagged:         no
Form:           none
Pages:          223
Encrypted:      no
Page    1 size: 595 x 842 pts (A4) (rotated 0 degrees)
Page    2 size: 595 x 842 pts (A4) (rotated 0 degrees)
Page    3 size: 595.32 x 841.92 pts (A4) (rotated 0 degrees)
Page    4 size: 595.44 x 842.04 pts (A4) (rotated 0 degrees)
Page    5 size: 595.44 x 842.04 pts (A4) (rotated 0 degrees)
Page    6 size: 595.2 x 841.9 pts (A4) (rotated 0 degrees)
Page    7 size: 595.45 x 841.9 pts (A4) (rotated 0 degrees)
Page    8 size: 595.45 x 841.9 pts (A4) (rotated 0 degrees)
Page    9 size: 595.2 x 841.44 pts (rotated 0 degrees)
Page   10 size: 595.2 x 841.44 pts (rotated 0 degrees)
Page   11 size: 595.2 x 841.68 pts (rotated 0 degrees)
Page   12 size: 594.54 x 840.78 pts (rotated 0 degrees)
Page   13 size: 591.85 x 835.45 pts (rotated 0 degrees)
Page   14 size: 593.75 x 835.45 pts (rotated 0 degrees)
Page   15 size: 595.2 x 841.44 pts (rotated 0 degrees)
Page   16 size: 595.32 x 841.92 pts (A4) (rotated 0 degrees)
Page   17 size: 593.5 x 840.7 pts (rotated 0 degrees)
Page   18 size: 594.72 x 840.96 pts (rotated 0 degrees)
Page   19 size: 596 x 842 pts (A4) (rotated 0 degrees)
Page   20 size: 595.2 x 841.68 pts (rotated 0 degrees)
File size:      33926636 bytes
Optimized:      no
PDF version:    1.4

In a command line I would possibly only use the desired page ### and size: values (discarding the verbage) to make line by line match analysis easier.

We can see that in this case, as suspected by @mkl, there is some commonality in sequential pages.

The above is less than a 10% sample and may not represent the full picture but its promising enough to pair either X or Y values in sequential pages. I ran 200 pages (on this slow machine in seconds) and the output slowly flashing by had sufficient similarities to suggest this is a viable part answer to build upon.

The majority of pairs are matched in first value but the oddity was 13 & 14 which matched in the second value, HOWEVER note number 6 matches second value with 7 & 8 but is not the same document, thus cross checking such cases may be desired.