I am interested in scraping PDFs from any of the speakers on this page. How might I go about this: https://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker=Amy Khor
The website has changed from previous occasions and the code used previously such as this:
import requests
from bs4 import BeautifulSoup
url = 'http://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker='
search_term = 'Amy Khor'
data = {
'keywords': search_term,
'search-type': 'basic',
'keywords-type': 'all',
'page-num': 1
}
soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
cnt = 1
while True:
print()
print('Page no. {}'.format(cnt))
print('-' * 80)
for a in soup.select('a[href$=".pdf"]'):
print(a['href'])
if soup.select_one('span.next-10'):
data['page-num'] = 10
cnt = 1
soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
else:
break
The code above no longer works...
CodePudding user response:
Here's how I'd do it if I were to start from scratch.
Google Search is actually pretty powerful, and I feel like this query gets your pdfs:
"Amy Khor" site:https://www.nas.gov.sg/archivesonline/data/pdfdoc filetype:pdf
Then, I'd use either BeautifulSoup or, even better, something like googlesearch-python to get the results and process them into your desired lxml format.
CodePudding user response:
To get all PDF links from the pages you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.nas.gov.sg/archivesonline/speeches/search-result"
params = {
"search-type": "advanced",
"speaker": "Amy Khor",
"page-num": "1",
}
for params["page-num"] in range(1, 3): # <--- increase number of pages here
soup = BeautifulSoup(
requests.get(url, params=params).content, "html.parser"
)
for a in soup.select('a[href$="pdf"]'):
print("https:" a["href"])
print("-" * 80)
Prints:
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MINDEF_20171123001_2.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20151126001.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20160229002.pdf
...and so on.