Home > Net >  Why is BeautifulSoup leaving out parts of a website?
Why is BeautifulSoup leaving out parts of a website?

Time:12-12

I'm completely new to python and wanted to dip my toes into web scraping. So I tried to scrape the rankings of players in https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3# But when I try to access the rankings and ratings of each player, it gives none as a return. This is all inside the so I assume beautifulsoup isn't able to access it because it's javascript, but I'm not sure. please help ._.

Input:

from bs4 import BeautifulSoup
import requests


URL_USAFencingOctoberNac_2022 = "https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3"
October_Nac_2022 = requests.get(URL_USAFencingOctoberNac_2022)
October_Nac_2022 = BeautifulSoup(October_Nac_2022.text, "html.parser")


tbody = October_Nac_2022.tbody
print(tbody)

Output:

None

CodePudding user response:

In this case the problem is not with BS4 but with your analysis before starting the scraping. The data which you are looking for is not available directly from the request you have made.

To get the data you have to make request to a different back end URL https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=name, which will give you a JSON response.

The code will look something like this

from requests import get
url = 'https://www.fencingtimelive.com/events/competitors/data/F87F9E882BD6467FB9461F68E484B8B3?sort=
name'
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0 X-Requested-With XMLHttpRequest'})
print(response.json())

If you want to test performance of BS4 consider the below example for fetching the blog post links from the link

from requests import get
from bs4 import BeautifulSoup as bs
url = "https://www.zyte.com/blog/"
response = get(url, headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux soup = bs(response.content)
posts = soup.find_all('div', {"class":"oxy-posts"})
print(len(posts))

Note: Before writing code for scraping analyse the website thoroughly. It will give the idea about the data sources of the website

CodePudding user response:

One potential problem with your code is that you are attempting to access the tbody element before it has been defined. The variable October_Nac_2022 is defined as the result of calling BeautifulSoup() on the text of the requested web page, so it will not contain any elements until you have parsed the HTML to find them.

Here is one way you can fix the issue:

from bs4 import BeautifulSoup
import requests

URL_USAFencingOctoberNac_2022 = "https://www.fencingtimelive.com/events/competitors/F87F9E882BD6467FB9461F68E484B8B3"
October_Nac_2022 = requests.get(URL_USAFencingOctoberNac_2022)
October_Nac_2022 = BeautifulSoup(October_Nac_2022.text, "html.parser")

# Find the <tbody> element in the HTML
tbody = October_Nac_2022.find('tbody')

# Print the tbody element
print(tbody)

Note that this code uses the find() method of the Beautiful Soup object to search for the element in the parsed HTML. This method will return the first element it finds, or None if there are no elements in the HTML.

  • Related