I'm looking to scrape the website 'https://quotes.toscrape.com/' and retrieve for each quote, the author's full name, date of birth, and location of birth. There are 10 pages of quotes. To retrieve the author's date of birth and location of birth, one must follow the <a href 'about'>
link next to the author's name.
Functionally speaking, I need to scrape 10 pages of quotes and follow each quote author's 'about' link to retrieve their data mentioned in the paragraph above ^, and then compile this data into a list
or dict
, without duplicates.
I can complete some of these tasks separately, but I am new to BeautifulSoup
and Python
and am having trouble implementing them all together. My success so far is limited to retrieving the author's info from quotes on page 1, but being unable to properly assign the function's returns to a variable (without an erroneous in-function print statement), and unable to implement the 10 page scan... Any help is greatly appreciated.
def get_author_dob(url):
response_auth = requests.get(url)
html_auth = response_auth.content
auth_soup = BeautifulSoup(html_auth)
auth_tag = auth_soup.find("span", class_="author-born-date")
return [auth_tag.text]
def get_author_bplace(url):
response_auth2 = requests.get(url)
html_auth2 = response_auth2.content
auth_soup2 = BeautifulSoup(html_auth2)
auth_tag2 = auth_soup2.find("span", class_="author-born-location")
return [auth_tag2.text]
url = 'http://quotes.toscrape.com/'
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
def auth_retrieval (url):
for t in tag:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = url link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author dob b_place
print (authorss)
I need to use 'return' in the above function to be able to assign the results to a variable, but when I do, it only returns one value. I have tried the generator route with yield but am confused on how to implement the counter when I am already iterating over tag. Also confused with where and how to insert 10-page scan task. Thanks in advance
CodePudding user response:
Your code is almost working and just needs a bit of refactoring.
One thing I found out was that you could access individual pages using this URL pattern,
https://quotes.toscrape.com/page/{page_number}/
Now, once you've figured out that, we can take advantage of this pattern in the code,
#refactored the auth_retrieval to this one for reusability
def get_page_data(base_url, tags):
all_authors = []
for t in tags:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = base_url link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author dob b_place
print(authorss)
all_authors.append(authorss)
return all_authors
url = 'https://quotes.toscrape.com/' #base url for the website
total_pages = 10
all_page_authors = []
for i in range(1, total_pages):
page_url = f'{url}page/{i}/' #https://quotes.toscrape.com/page/1, 2, ... 10
print(page_url)
page = requests.get(page_url)
soup = BeautifulSoup(page.content,'html.parser')
tags = soup.find_all("div", class_="quote")
all_page_authors = get_page_data(url, tags) #merge all authors into one list
print(all_page_authors)
get_author_dob
and get_author_bplace
remain the same.
The final output will be an array of authors where each author's info is an array.
[['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],
['J.K. Rowling', 'July 31, 1965', 'in Yate, South Gloucestershire, England, The United Kingdom'],
['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],...]
CodePudding user response:
You are on the right way but you could simplify the process a bit:
Use
while-loop
and check ifnext
button is available to perform paging. This would also work if number of pages is not known. You could still add an interuption by a specific number of pages if needed.Reduce number of requests and scrape available and necessarry information in one go. If you pick a bit more it is not bad you could filter it in a easy way to get your goal
df[['author','dob','lob']].drop_duplicates()
Store information in a structured way like
dict
instead of single variables.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_author(url):
soup = BeautifulSoup(requests.get(url).text)
author = {
'dob': soup.select_one('.author-born-date').text,
'lob': soup.select_one('.author-born-location').text,
'url': url
}
return author
base_url = 'http://quotes.toscrape.com'
url = base_url
quotes = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('div.quote'):
qoute = {
'author':e.select_one('small.author').text,
'qoute':e.select_one('span.text').text
}
qoute.update(get_author(base_url e.a.get('href')))
quotes.append(qoute)
if soup.select_one('li.next a'):
url=base_url soup.select_one('li.next a').get('href')
print(url)
else:
break
pd.DataFrame(quotes)
Output
author | qoute | dob | lob | url | |
---|---|---|---|---|---|
0 | Albert Einstein | “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” | March 14, 1879 | in Ulm, Germany | http://quotes.toscrape.com/author/Albert-Einstein |
1 | J.K. Rowling | “It is our choices, Harry, that show what we truly are, far more than our abilities.” | July 31, 1965 | in Yate, South Gloucestershire, England, The United Kingdom | http://quotes.toscrape.com/author/J-K-Rowling |
... | ... | ... | ... | ... | ... |
98 | Dr. Seuss | “A person's a person, no matter how small.” | March 02, 1904 | in Springfield, MA, The United States | http://quotes.toscrape.com/author/Dr-Seuss |
99 | George R.R. Martin | “... a mind needs books as a sword needs a whetstone, if it is to keep its edge.” | September 20, 1948 | in Bayonne, New Jersey, The United States | http://quotes.toscrape.com/author/George-R-R-Martin |