Home > other >  Scaling of pdf file size
Scaling of pdf file size

Time:11-23

I have a question on how pdf file size scales.

I noticed that when I produce single page pdfs from a given pdf, then the file size is almost always approximately half of the size of the original file. (See attached.) My question is:

  1. Why is this the case? This seems to suggest that the non-text information (e.g. styling) takes up more than half of the pdf file size.
  2. Is there any tricks to "compressing" pdfs so that it has smaller memory?

Figures: For both figures, the pdf with the long name (2211.11725.pdf and 2211.11712.pdf) is the original document, and were produced by print -> save as pdf on MacOS Monterey Ver 12.4.

Original documents:

  • enter image description here enter image description here

    CodePudding user response:

    To take your smaller file, let's pick just page one of the twelve:

    $ cpdf 2211.11725.pdf 1 -o out.pdf
    

    It's 84kb, i.e almost half of the original 12-page file's 192kb size. Let's get rid of any inefficiency in the object graph:

    $ cpdf -squeeze out.pdf -o out2.pdf
    Initial file size is 84392 bytes
    Beginning squeeze: 156 objects
    Squeezing... Down to 146 objects
    Squeezing... Down to 136 objects
    Squeezing page data and xobjects
    Recompressing document
    Final file size is 67147 bytes, 79.57% of original.
    

    Ok, now 64kb, about 1/3 of the original. Now let's decompress it so we can look at it:

    cpdf -decompress -no-preserve-objstm out2.pdf -o decomp.pdf
    

    You can now open the file up in a text editor, and scroll around. As you can see, there is one big item other than the actual page content: the embedded fonts. These are shared between all the pages. In addition, the fonts in this new page still contain characters used on other pages (but not this one). To reduce the font size further, you would have to re-subset them according to the characters now in use. Cpdf can't do this, though you might find software which can.

    Another situation in which you might see this non-linearity, is if there is a big background image on each page - it will appear only once in the original file and will be shared between pages.

    You can remove the embedded fonts with cpdf, but it's really not a good idea!

    cpdf -remove-fonts out2.pdf -o out3.pdf
    

    No you're down to 18kb. But Adobe Reader won't display the text. MacOS Preview will, after a fashion.

    CodePudding user response:

    Your sample files are of a slightly problematic style as words are often enmeshed by latex generation so extract text of page 17 can look like

    2
    p
     1
    ≤
    k
    ≤
    2
    n
    −
    1.Altogether,weobtainthebound
    s
    p
    k
    ≤
    m
    k
     
    m
    k
    −
    2
    p
    −
    1
    −
    1
    ,
    2
    p
     1
    ≤
    k
    ≤
    2
    n.
    Toprove(B),wewillusethenotation[
    ω
    p
     1
    ]
    i
    todenotethemap[
    ω
    p
     1
    ]
    i
    :
    H
    i
    (
    M
    )
    →
    H
    i
     2
    p
     2
    (
    M
    ).Itthenfollowsfrom(3.7)that
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
    s
    p
    i
    =
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
     
    dimcoker[
    ω
    p
     1
    ]
    i
    −
    2
    p
    −
    2
     dimker[
    ω
    p
     1
    ]
    i
    −
    2
    p
    −
    1
     
    =
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
     
    dimcoker[
    ω
    p
     1
    ]
    i
    −
    2
    p
    −
    2
    −
    dimker[
    ω
    p
     1
    ]
    i
    −
    2
    p
    −
    2
     
     dimker[
    ω
    p
     1
    ]
    k
    −
    2
    p
    −
    1
    =
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
    (dim
    H
    i
    (
    M
    )
    −
    dim
    H
    i
    −
    2
    p
    −
    2
    (
    M
    )) dimker[
    ω
    p
     1
    ]
    k
    −
    2
    p
    −
    1
    (3.13)
    whereinthelastline,wehaveusedtherank-nullityrelation,
    dimcoker[
    ω
    p
     1
    ]
    i
    −
    2
    p
    −
    2
    −
    dimker[
    ω
    p
     1
    ]
    i
    −
    2
    p
    −
    2
    =dim
    H
    i
    (
    M
    )
    −
    dim
    H
    i
    −
    2
    p
    −
    2
    (
    M
    )
    .
    Nowwhen
    k<
    2
    p
     1,anumberofthetermsontheright-handsideof(3.13)trivially
    vanish.Hence,wefindfor
    k<
    2
    p
     1
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
    s
    p
    i
    =
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
    b
    i
    ≤
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
    m
    i
    =
    k
     
    i
    =
    k
    −
    2
    p
    (
    −
    1)
    k
    −
    i
    m
    i
    ,
    (3.14)
    havinginthemiddleappliedthestandardstrongMorseinequality(3.2).
    When2
    p
     1
    ≤
    k
    ,(3.13)simplifiesto
    k
     
    i
    =0
    (
    −
    1)
    k
    −
    i
    s
    p
    i
    =
    k
     
    i
    =
    k
    −
    2
    p
    −
    1
    (
    −
    1)
    k
    −
    i
    b
    i
     dimker[
    ω
    p
     1
    ]
    k
    −
    2
    p
    −
    1
    ≤
    k
     
    i
    =
    k
    −
    2
    p
    (
    −
    1)
    k
    −
    i
    b
    i
    (3.15)
    Moreover,if
    k
    isoddand2
    p
     1
    ≤
    (
    k
    =2
    j
     1)
    ≤
    2
    n
    −
    1,then[
    ω
    j
    −
    p
    ]
    ∈
    H
    k
    −
    2
    p
    −
    1
    isnotinthe
    17
    
    1. every printout will be different so using just one example 2211.11712 = 391 KB and MS Print to PDF I get these results for page 17 (102 KB) and 24 (72KB) which then begs the question why is my 24 smaller than 17, but yours is the opposite way round and the answer is that's down to the PDF generator and how it handles images and embedded fonts.

    enter image description here

    My outputs turn out to be images without any fonts thus are inferior to yours which presumably have readable text.

    Switching to a different output I get different results, 17 is now only 97 KB and 24 is 74 KB but they are not images but nor are they text (they as vectors are simple paths) so still not searchable.

    enter image description here

    Thus, every input and output will have different outcomes but to be searchable the fonts need to be included as fully embedded or subset.

    Size is not really a problem smaller is certainly not better and compressing a file usually leads to a degraded result.

    However to contradict that last comment I was surprised to find smallest in this case is the best by using Firefox print to PDF which retains the searchable text

    enter image description here

    File Size 44 KB and 29 KB (wow)

    enter image description here

    And for comparison 2211.11725-Page1 from Firefox is 64.1 KB (65,660 bytes) so also highly optimised. However, the hyperlink on left is discarded along with others that are naturally dropped.

  • Related