Home > database >  Using Selenium to return which page of the PDF is being displayed
Using Selenium to return which page of the PDF is being displayed

Time:09-07

I have Selenium opening many pdfs for me from Google Search (using f"https://www.google.com/search?q=filetype:pdf {search_term}" and then clicking on the first link)

I want to know which pages contain my keyword WITHOUT downloading the pdf first. I believe I can use

Ctrl F --> keyword --> {scrape page number} --> Tab (next keyword) --> {scrape page number} --> ... --> switch to next PDF

How can I accomplish the {scrape page number} part?

Context

For each PDF I need to grab these numbers as a list or in a Pandas DataFrame or anything I can use to feed in camelot.read_pdf() later

The idea is also once I have these page numbers, I can selectively download pages of these pdfs and save on storage, memory and network speeds rather than downloading and parsing the entire pdf

Using BeautifulSoup

PDFs have a small gray box at the top with the current page number and total pages number with the option to skip around the PDF.


<input data-element-focusable="true" id="pageselector"  type="text" value="151" title="Page number (Ctrl Alt G)" aria-label="Go to any page between 1 and 216">

The value in this input tag contains the number I am looking for.

Other SO answers

I'm aware that reading PDFs programatically is challenging and I'm currently using this function (enter image description here

It does not matter which browser extension you are using they all hold the file somewhere in your file system, note the difference here the data says its on the web but the edit message show otherwise. In this case the field is secured outside the browser however Ctrl D C gives me

File: https://africau.edu/images/default/sample.pdf
Created: 3/1/2006 7:28:26 AM
Application: Rave (http://www.nevrona.com/rave)
PDF Producer: Nevrona Designs
PDF Version: 1.3
File Size: 2.96 KB (3,028 Bytes)
Number of Pages: 2
Page Size: 8.5 x 11.0 in (Letter)
   
Fonts: Helvetica (Type1; Ansi)

enter image description here

Mozilla PDF.js is a different beast, so may be more addressable, but as you found you can use a hybrid approach in the index.htm of Chrome/Edge you could equally do that offline.

So on the basis you have scraped a list of URLs the simplest solution should be
curl -O (or -o tmp.pdf) URL & pdftotext | find "Keyword"

you will need to adapt that a bit to show page and line number but that's a different question or two
https://stackoverflow.com/a/72440765/10802527
https://stackoverflow.com/a/72778117/10802527

  • Related