I am writing a script in Python using the modules 'requests' and 'BeautifulSoup' to scrape results from football matches found in the links from the following page:
https://www.premierleague.com/results?co=1&se=363&cl=-1
The task consists of two steps (taking the first match, Arsenal against Brighton, as an example):
Extract and navigate to the href "https://www.premierleague.com/match/59266" found in the element:
div data-template-rendered data-href.Navigate or to the "Stats"-tab and extracting the information found in the element:
tbody class = "matchCentreStatsContainer".
I have already tried things like
page = requests.get("https://www.premierleague.com/match/59266")
soup = BeautifulSoup(page.text, "html.parser")
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
but I am not able to locate any of the elements in step 1) or 2) (empty list is returned).
CodePudding user response:
Instead of this:
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
Use this
soup.findAll({"class" : "matchCentreStatsContainer"})
It will work.
CodePudding user response:
In this case the problem is simply that you are looking for the wrong thing. There is no <div class="matchCentreStatsContainer">
on that page, that's a <tbody>
so it doesn't match. If you want the div, do:
divs = soup.find_all("div", class_="statsSection")
Otherwise search for the tbody
s:
soup.find_all("tbody", class_="matchCentreStatsContainer")
Incidentally the Right Way (TM) to match classes is with class_
, which takes either a list or a string (for a single class). This was added to bs4 a while back, but the old syntax is still floating around a lot.
Do note your first url as posted here is invalid: it needs a http:
or https:
before it.