Failing to webscrape titles and authors in a website with multi-links-CodePudding

I am trying to webscrape this link. As an example, I just want to scrape the first page. I would like to collect titles and authors for each of the 10 link you find in the first page.

To gather titles and authors, I write the following line of code:

from bs4 import BeautifulSoup
import requests
import numpy as np

url = 'https://www.bis.org/cbspeeches/index.htm?m=1123'
  
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('#cbspeeches_list a') # '#cbspeeches_list a' got via SelectorGadget

However, I get an empty list. What am I doing wrong?

Thanks!

CodePudding user response：

Data is loaded from external source by API as post method. Just you have to use the API url.

from bs4 import BeautifulSoup
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
    "content-type": "application/x-www-form-urlencoded",
    "X-Requested-With": "XMLHttpRequest"
    }

req=requests.post(url,headers=headers,data=payload)
print(req)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
    title = card.select_one('.title a').get_text()
    author = card.select_one('.authorlnk.dashed').get_text().strip()
    data.append({
        'title':title,
        'author':author
        })

print(data)

Output

[{'title': 'Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022', 'author': '\nPablo Hernández de Cos'}, {'title': 'Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank ', 'author': '\nKlaas Knot'}, {'title': 'Luis de Guindos: Challenges for monetary policy', 'author': '\nLuis de Guindos'}, {'title': 'Fabio Panetta: Europe as a common 
shield -  protecting the euro area economy from global shocks', 'author': '\nFabio Panetta'}, 
{'title': 'Victoria Cleland: Rowing in unison to enhance cross-border payments', 'author': '\nVictoria Cleland'}, {'title': 'Yaron Amir: A look at the future world of payments - trends, the market, and regulation', 'author': '\nYaron Amir'}, {'title': 'Ásgeir Jónsson: Speech – 61st Annual Meeting of the Central Bank of Iceland', 'author': '\nÁsgeir Jónsson'}, {'title': 'Lesetja Kganyago: Project Khokha 2 report launch', 'author': '\nLesetja Kganyago'}, {'title': 'Huw Pill: What did the monetarists ever do for us?', 'author': '\nHuw Pill'}, {'title': 'Shaktikanta Das: Inaugural address - Statistics Day Conference ', 'author': '\nShaktikanta Das'}]

CodePudding user response：

Try this:

data = {
  'from': '',
  'till': '',
  'objid': 'cbspeeches',
  'page': '',
  'paging_length': '25',
  'sort_list': 'date_desc',
  'theme': 'cbspeeches',
  'ml': 'false',
  'mlurl': '',
  'emptylisttext': ''
}

response = requests.post('https://www.bis.org/doclist/cbspeeches.htm', data=data)

soup = BeautifulSoup(response.content)

for elem in soup.find_all("tr"):
    # the title
    print(elem.find("a").text)
    # the author
    print(elem.find("a", class_="authorlnk dashed").text)
    print()

Prints out:

Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022
Pablo Hernández de Cos

Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank 
Klaas Knot