I am trying to webscrape this link. As an example, I just want to scrape the first page. I would like to collect titles and authors for each of the 10 link you find in the first page.
To gather titles and authors, I write the following line of code:
from bs4 import BeautifulSoup
import requests
import numpy as np
url = 'https://www.bis.org/cbspeeches/index.htm?m=1123'
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('#cbspeeches_list a') # '#cbspeeches_list a' got via SelectorGadget
However, I get an empty list. What am I doing wrong?
Thanks!
CodePudding user response:
Data is loaded from external source by API as post method. Just you have to use the API url.
from bs4 import BeautifulSoup
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
"content-type": "application/x-www-form-urlencoded",
"X-Requested-With": "XMLHttpRequest"
}
req=requests.post(url,headers=headers,data=payload)
print(req)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
title = card.select_one('.title a').get_text()
author = card.select_one('.authorlnk.dashed').get_text().strip()
data.append({
'title':title,
'author':author
})
print(data)
Output
[{'title': 'Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022', 'author': '\nPablo Hernández de Cos'}, {'title': 'Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank ', 'author': '\nKlaas Knot'}, {'title': 'Luis de Guindos: Challenges for monetary policy', 'author': '\nLuis de Guindos'}, {'title': 'Fabio Panetta: Europe as a common
shield - protecting the euro area economy from global shocks', 'author': '\nFabio Panetta'},
{'title': 'Victoria Cleland: Rowing in unison to enhance cross-border payments', 'author': '\nVictoria Cleland'}, {'title': 'Yaron Amir: A look at the future world of payments - trends, the market, and regulation', 'author': '\nYaron Amir'}, {'title': 'Ásgeir Jónsson: Speech – 61st Annual Meeting of the Central Bank of Iceland', 'author': '\nÁsgeir Jónsson'}, {'title': 'Lesetja Kganyago: Project Khokha 2 report launch', 'author': '\nLesetja Kganyago'}, {'title': 'Huw Pill: What did the monetarists ever do for us?', 'author': '\nHuw Pill'}, {'title': 'Shaktikanta Das: Inaugural address - Statistics Day Conference ', 'author': '\nShaktikanta Das'}]
CodePudding user response:
Try this:
data = {
'from': '',
'till': '',
'objid': 'cbspeeches',
'page': '',
'paging_length': '25',
'sort_list': 'date_desc',
'theme': 'cbspeeches',
'ml': 'false',
'mlurl': '',
'emptylisttext': ''
}
response = requests.post('https://www.bis.org/doclist/cbspeeches.htm', data=data)
soup = BeautifulSoup(response.content)
for elem in soup.find_all("tr"):
# the title
print(elem.find("a").text)
# the author
print(elem.find("a", class_="authorlnk dashed").text)
print()
Prints out:
Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022
Pablo Hernández de Cos
Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank
Klaas Knot