I want to parse a web page which is really a PDF file using python. bellow is the link of a sample PDF web page:
The command line in poppler utils is
pdftohtml -f 1 -l 2 -fmt png -p -c http://www.jsu.edu/ire/factbook/JSUFactbook14-15.pdf index.htm
Dont expect it to be fast it has to download the whole file to find and sort all the random objects on each page.
e.g If you are searching for that "first" Jacksonville State University it is in the first half of object number 6,855 and may be found above the word Book which is also part of the same object so both lines were either inserted as one or later merged into page 1 as is often the case when the cover is designed and added later using InDesign.
Once the file is downloaded, decrypted and sorted by pdftohtml then it can start composing a HTML for each page and add bookmarks for those pages. That is a slow process, and not much quicker if you only parse page one by set -l 1
in place of -l 2