Home > Mobile >  Scraping PDFs from page containing multiple search results
Scraping PDFs from page containing multiple search results

Time:07-23

I am interested in scraping PDFs from any of the speakers on this page. How might I go about this: https://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker=Amy Khor

The website has changed from previous occasions and the code used previously such as this:


import requests
from bs4 import BeautifulSoup

url = 'http://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker='

search_term = 'Amy Khor'

data = {
    'keywords': search_term,
    'search-type': 'basic',
    'keywords-type': 'all',
    'page-num': 1
}

soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')

cnt = 1
while True:

    print()
    print('Page no. {}'.format(cnt))
    print('-' * 80)

    for a in soup.select('a[href$=".pdf"]'):
        print(a['href'])

    if soup.select_one('span.next-10'):
        data['page-num']  = 10
        cnt  = 1
        soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
    else:
        break

The code above no longer works...

CodePudding user response:

Here's how I'd do it if I were to start from scratch.

Google Search is actually pretty powerful, and I feel like this query gets your pdfs:

"Amy Khor" site:https://www.nas.gov.sg/archivesonline/data/pdfdoc filetype:pdf

Then, I'd use either BeautifulSoup or, even better, something like googlesearch-python to get the results and process them into your desired lxml format.

CodePudding user response:

To get all PDF links from the pages you can use next example:

import requests
from bs4 import BeautifulSoup

url = "https://www.nas.gov.sg/archivesonline/speeches/search-result"

params = {
    "search-type": "advanced",
    "speaker": "Amy Khor",
    "page-num": "1",
}

for params["page-num"] in range(1, 3):    # <--- increase number of pages here
    soup = BeautifulSoup(
        requests.get(url, params=params).content, "html.parser"
    )
    for a in soup.select('a[href$="pdf"]'):
        print("https:"   a["href"])
    print("-" * 80)

Prints:

https://www.nas.gov.sg/archivesonline/data/pdfdoc/MINDEF_20171123001_2.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20151126001.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20160229002.pdf

...and so on.
  • Related