Home > other >  How to parse a PDF format web page using python and print it's content in a pretty format?
How to parse a PDF format web page using python and print it's content in a pretty format?

Time:09-17

I want to parse a web page which is really a PDF file using python. bellow is the link of a sample PDF web page:

enter image description here

The command line in poppler utils is

pdftohtml -f 1 -l 2 -fmt png -p -c http://www.jsu.edu/ire/factbook/JSUFactbook14-15.pdf index.htm

Dont expect it to be fast it has to download the whole file to find and sort all the random objects on each page.

e.g If you are searching for that "first" Jacksonville State University it is in the first half of object number 6,855 and may be found above the word Book which is also part of the same object so both lines were either inserted as one or later merged into page 1 as is often the case when the cover is designed and added later using InDesign.

Once the file is downloaded, decrypted and sorted by pdftohtml then it can start composing a HTML for each page and add bookmarks for those pages. That is a slow process, and not much quicker if you only parse page one by set -l 1 in place of -l 2

  • Related