Home > OS >  Failing to webscrape titles and authors in a website with multi-links
Failing to webscrape titles and authors in a website with multi-links

Time:07-07

I am trying to webscrape this link. As an example, I just want to scrape the first page. I would like to collect titles and authors for each of the 10 link you find in the first page.

To gather titles and authors, I write the following line of code:

from bs4 import BeautifulSoup
import requests
import numpy as np

url = 'https://www.bis.org/cbspeeches/index.htm?m=1123'
  
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('#cbspeeches_list a') # '#cbspeeches_list a' got via SelectorGadget

However, I get an empty list. What am I doing wrong?

Thanks!

CodePudding user response:

Data is loaded from external source by API as post method. Just you have to use the API url.

from bs4 import BeautifulSoup
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
    "content-type": "application/x-www-form-urlencoded",
    "X-Requested-With": "XMLHttpRequest"
    }

req=requests.post(url,headers=headers,data=payload)
print(req)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
    title = card.select_one('.title a').get_text()
    author = card.select_one('.authorlnk.dashed').get_text().strip()
    data.append({
        'title':title,
        'author':author
        })

print(data)

Output

[{'title': 'Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022', 'author': '\nPablo Hernández de Cos'}, {'title': 'Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank ', 'author': '\nKlaas Knot'}, {'title': 'Luis de Guindos: Challenges for monetary policy', 'author': '\nLuis de Guindos'}, {'title': 'Fabio Panetta: Europe as a common 
shield -  protecting the euro area economy from global shocks', 'author': '\nFabio Panetta'}, 
{'title': 'Victoria Cleland: Rowing in unison to enhance cross-border payments', 'author': '\nVictoria Cleland'}, {'title': 'Yaron Amir: A look at the future world of payments - trends, the market, and regulation', 'author': '\nYaron Amir'}, {'title': 'Ásgeir Jónsson: Speech – 61st Annual Meeting of the Central Bank of Iceland', 'author': '\nÁsgeir Jónsson'}, {'title': 'Lesetja Kganyago: Project Khokha 2 report launch', 'author': '\nLesetja Kganyago'}, {'title': 'Huw Pill: What did the monetarists ever do for us?', 'author': '\nHuw Pill'}, {'title': 'Shaktikanta Das: Inaugural address - Statistics Day Conference ', 'author': '\nShaktikanta Das'}]    


      

CodePudding user response:

Try this:

data = {
  'from': '',
  'till': '',
  'objid': 'cbspeeches',
  'page': '',
  'paging_length': '25',
  'sort_list': 'date_desc',
  'theme': 'cbspeeches',
  'ml': 'false',
  'mlurl': '',
  'emptylisttext': ''
}

response = requests.post('https://www.bis.org/doclist/cbspeeches.htm', data=data)

soup = BeautifulSoup(response.content)

for elem in soup.find_all("tr"):
    # the title
    print(elem.find("a").text)
    # the author
    print(elem.find("a", class_="authorlnk dashed").text)
    print()

Prints out:

Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022
Pablo Hernández de Cos

Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank 
Klaas Knot
  • Related