I've written a function to try and get the names of authors and their respective links from a sandbox website (https://quotes.toscrape.com/), which should move onto the next page when all have been covered.
It works for the first two pages but fails when moving onto the third with the error 'NoneType' object has no attribute 'find_all'.
Why would it break at the start of the new page when it has already successfully moved pages already?
Here's the function:
def AuthorLink(url):
a = 0
url = url
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
divContainer = soup.find("div", class_="container")
divRow = divContainer.find_all("div", class_= "row")
for result in divRow:
divQuotes = result.find_all("div", class_="quote")
for quotes in divQuotes:
for el in quotes.find_all("small", class_="author"):
print(el.get_text())
for link in quotes.find_all("a"):
if link['href'][1:7] == "author":
print(url link['href'])
a = 1
print("Page:", a)
nav = soup.find("li", class_="next")
nextPage = nav.find("a")
AuthorLink(url nextPage['href'])
Here's the code that it broke on:
5 soup = BeautifulSoup(page.content, "html.parser")
6 divContainer = soup.find("div", class_="container")
----> 7 divRow = divContainer.find_all("div", class_= "row")
I don't see why this is happening if it ran for the first two pages successfully.
I've checked the structure of the website and it seems little has changed from each page.
I've also tried to change the code so that instead of using the link from "Next" at the bottom of the page, it just adds the number of the next page to the URL but this doesn't work either.
CodePudding user response:
You are facing this error because your new requsting url
is adding in that previous url
which means.
url
value is iterations:
- "https://quotes.toscrape.com/", where works;
- "https://quotes.toscrape.com/page/2/", where also works;
- "https://quotes.toscrape.com/page/2//page/3/", but here website can't serve the page. So, doesn't work.
Exact solution could be different, but here's a little bit changed in my answer.
import requests
from bs4 import BeautifulSoup
base_url="https://quotes.toscrape.com"
def AuthorLink(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
divContainer = soup.find("div", class_="container")
divRow = divContainer.find_all("div", class_= "row")[1]
divQuotes = divRow.find_all("div", class_="quote")
for quotes in divQuotes:
for el in quotes.find_all("small", class_="author"):
print(el.get_text())
for link in quotes.find_all("a"):
if link['href'][1:7] == "author":
print(base_url link['href'])
for i in range(1,5):
AuthorLink(f"{base_url}/page/{i}")
I have defined new base_url
to store actual website link. And next page is "/page/[i]" which means we can use for loop to generate i=1,2,3... . And other change is print(base_url link['href'])
where you had used url
instead of "base url" that again leads to same URL changing problem from above.