Home > Net >  Can I bulk-remove links from a pdf from the command line?
Can I bulk-remove links from a pdf from the command line?

Time:09-04

I'm downloading some newspapers as pdf (for posterity). One title is a pain, it includes URI links in the pdf itself, if you accidentally click these it opens a browser tab to a page that 500s. It's not so bad on a desktop computer, but a pain in the butt if someone is reading it with a tablet. Each issues has approximately 200 of these links.

For a different title, it was as simple as using QPDF, like so:

qpdf --qdf --object-streams=disable file temp-file

This puts the temp version into postscript mode or something, and I was able to nuke the links with something like this:

s/obj\n<<\n(  \/A <<\n    \/S \/URI. ?)>>\nendobj/"obj\n<<\n" . " " x length($1). ">>\nendobj"/sge

This still works. However, a 15 meg original pdf is now becoming a 108meg "fixed" pdf. I can accept some bloat, but 720% is a bit absurd (I think it was more like 10% on the other title). Whenever I google for how to do this, I get results for Acrobat Reader and how you can click around in 20 menus to do such... does no one that uses Adobe products ever want to automate this stuff? There are between 180 and 300 links in a typical issue, spread across 45-150 pages (Sunday editions).

Are there any tools that can do this? Are there any clever arguments to qpdf that will make this more reasonable?

PS Yes I know it's hacky as hell to just overwrite the URIs with spaces, but I've never managed to figure out how to remove the objects entirely since their references also have to be removed.

CodePudding user response:

You can use HexaPDF (you need to have Ruby installed and then use gem install hexapdf to install HexaPDF) and the following small script to remove the links:

require 'hexapdf'

HexaPDF::Document.open(ARGV[0]) do |doc|
  doc.pages.each do |page|
    page.each_annotation.select {|annot| annot[:Subtype] == :Link}.each do |annot|
      page[:Annots].delete(annot)
    end
  end
  doc.write(ARGV[0]   '_processed.pdf', optimize: true)
end

Then batch execute the script for all the files you want the links removed.

Note that this will remove all links.

CodePudding user response:

You can do this with the community edition of cpdf: https://community.coherentpdf.com/

To remove all links in a PDF (well, to replace them with an empty link):

cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '""' -o out.pdf

This does not remove the annotations - it just makes sure that clicking on them won't go anywhere. It leaves the annotation in place, but with an empty link. You could replace with a working URL too, of course:

cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '"https://www.google.com/"' -o out.pdf

(You can also use -replace-dict-entry-search to replace only certain URLs - see the manual.)

Or, if you just want rid of all the annotations (link and non-link):

cpdf -remove-annotations in.pdf -o out.pdf
  • Related