Home > OS >  How to perform paging to scrape quotes over several pages?
How to perform paging to scrape quotes over several pages?

Time:12-09

I'm looking to scrape the website 'https://quotes.toscrape.com/' and retrieve for each quote, the author's full name, date of birth, and location of birth. There are 10 pages of quotes. To retrieve the author's date of birth and location of birth, one must follow the <a href 'about'> link next to the author's name.

Functionally speaking, I need to scrape 10 pages of quotes and follow each quote author's 'about' link to retrieve their data mentioned in the paragraph above ^, and then compile this data into a list or dict, without duplicates.

I can complete some of these tasks separately, but I am new to BeautifulSoup and Python and am having trouble implementing them all together. My success so far is limited to retrieving the author's info from quotes on page 1, but being unable to properly assign the function's returns to a variable (without an erroneous in-function print statement), and unable to implement the 10 page scan... Any help is greatly appreciated.

def get_author_dob(url):
    response_auth = requests.get(url)
    html_auth = response_auth.content
    auth_soup = BeautifulSoup(html_auth)
    auth_tag = auth_soup.find("span", class_="author-born-date")
    return [auth_tag.text]

def get_author_bplace(url):
    response_auth2 = requests.get(url)
    html_auth2 = response_auth2.content
    auth_soup2 = BeautifulSoup(html_auth2)
    auth_tag2 = auth_soup2.find("span", class_="author-born-location")
    return [auth_tag2.text]

url = 'http://quotes.toscrape.com/'
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
def auth_retrieval (url):
    for t in tag:
        a = t.find("small", class_="author")
        author = [a.text]
        hrefs = t.a
        link = hrefs.get('href')
        link_url = url   link
        dob = get_author_dob(link_url)
        b_place = get_author_bplace(link_url)
        authorss = author   dob   b_place
        print (authorss)

I need to use 'return' in the above function to be able to assign the results to a variable, but when I do, it only returns one value. I have tried the generator route with yield but am confused on how to implement the counter when I am already iterating over tag. Also confused with where and how to insert 10-page scan task. Thanks in advance

CodePudding user response:

Your code is almost working and just needs a bit of refactoring.

One thing I found out was that you could access individual pages using this URL pattern,

https://quotes.toscrape.com/page/{page_number}/

Now, once you've figured out that, we can take advantage of this pattern in the code,

#refactored the auth_retrieval to this one for reusability
def get_page_data(base_url, tags):
    all_authors = []
    for t in tags:
        a = t.find("small", class_="author")
        author = [a.text]
        hrefs = t.a
        link = hrefs.get('href')
        link_url = base_url   link
        dob = get_author_dob(link_url)
        b_place = get_author_bplace(link_url)
        authorss = author   dob   b_place
        print(authorss)
        all_authors.append(authorss)
    return all_authors

url = 'https://quotes.toscrape.com/' #base url for the website
total_pages = 10

all_page_authors = []

for i in range(1, total_pages):
    page_url = f'{url}page/{i}/' #https://quotes.toscrape.com/page/1, 2, ... 10
    print(page_url)
    page = requests.get(page_url)
    soup = BeautifulSoup(page.content,'html.parser')
    tags = soup.find_all("div", class_="quote")
    all_page_authors  = get_page_data(url, tags) #merge all authors into one list

print(all_page_authors)

get_author_dob and get_author_bplace remain the same.

The final output will be an array of authors where each author's info is an array.

[['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],
 ['J.K. Rowling', 'July 31, 1965', 'in Yate, South Gloucestershire, England, The United Kingdom'],
 ['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],...]

CodePudding user response:

You are on the right way but you could simplify the process a bit:

  • Use while-loop and check if next button is available to perform paging. This would also work if number of pages is not known. You could still add an interuption by a specific number of pages if needed.

  • Reduce number of requests and scrape available and necessarry information in one go. If you pick a bit more it is not bad you could filter it in a easy way to get your goal df[['author','dob','lob']].drop_duplicates()

  • Store information in a structured way like dict instead of single variables.

Example

import pandas as pd
import requests
from bs4 import BeautifulSoup

def get_author(url):

    soup = BeautifulSoup(requests.get(url).text)
    author = {
        'dob': soup.select_one('.author-born-date').text,
        'lob': soup.select_one('.author-born-location').text,
        'url': url
    }
    return author


base_url = 'http://quotes.toscrape.com'
url = base_url

quotes = []

while True:
    soup = BeautifulSoup(requests.get(url).text)
    for e in soup.select('div.quote'):
        qoute = {
            'author':e.select_one('small.author').text,
            'qoute':e.select_one('span.text').text
        }
        qoute.update(get_author(base_url e.a.get('href')))
        quotes.append(qoute)

    if soup.select_one('li.next a'):
        url=base_url soup.select_one('li.next a').get('href')
        print(url)
    else:
        break
pd.DataFrame(quotes)

Output

author qoute dob lob url
0 Albert Einstein “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” March 14, 1879 in Ulm, Germany http://quotes.toscrape.com/author/Albert-Einstein
1 J.K. Rowling “It is our choices, Harry, that show what we truly are, far more than our abilities.” July 31, 1965 in Yate, South Gloucestershire, England, The United Kingdom http://quotes.toscrape.com/author/J-K-Rowling
... ... ... ... ... ...
98 Dr. Seuss “A person's a person, no matter how small.” March 02, 1904 in Springfield, MA, The United States http://quotes.toscrape.com/author/Dr-Seuss
99 George R.R. Martin “... a mind needs books as a sword needs a whetstone, if it is to keep its edge.” September 20, 1948 in Bayonne, New Jersey, The United States http://quotes.toscrape.com/author/George-R-R-Martin
  • Related