Home > database >  My function is only returning first element of the list when called. I am using BeautifulSoup to ext
My function is only returning first element of the list when called. I am using BeautifulSoup to ext

Time:06-14

a python beginner here. I am using BeautifulSoup to scrape the details(title, quantity in stock) of all books in the first page of books.toscrape.com . For that, first getting links to all the individual books has to take place. I have made the function page1_url for the same. The problem is, upon returning the list of the links extracted, only the first element of the list is returned. Please help in identifying the error or provide an alternative code using BeautifulSoup only. Thanks in advance!

import requests
from bs4 import BeautifulSoup


def page1_url(page1):
    response= requests.get(page1)
    data= BeautifulSoup(response.text,'html.parser')
   
    
    b1= data.find_all('h3')
    
    for i in b1:
        l=i.find_all('a')
        for j in l:
            l1=j['href']
            books_urls=[]
            books_urls.append(base_url   l1)
            books_urls=list(books_urls)
            return books_urls
            
    
                     

allPages = ['http://books.toscrape.com/catalogue/page-1.html',
            'http://books.toscrape.com/catalogue/page-2.html']

base_url= 'http://books.toscrape.com/catalogue/'
bookURLs= page1_url(allPages[0])
print(bookURLs) 

CodePudding user response:

You are rewriting the books_urls list for each link, and you are returning the function after the first element in the for j in l loop:

import requests
from bs4 import BeautifulSoup


def page1_url(page1):
    response= requests.get(page1)
    data= BeautifulSoup(response.text,'html.parser')
   
    b1= data.find_all('h3')
    
    # you were rewriting this list for each link
    books_urls = []

    for i in b1:
        l=i.find_all('a')
        for j in l:
            l1=j['href']
            books_urls.append(base_url   l1)

    # these lines had too many indents
    books_urls=list(books_urls)
    return books_urls
            
    
allPages = ['http://books.toscrape.com/catalogue/page-1.html',
            'http://books.toscrape.com/catalogue/page-2.html']

base_url= 'http://books.toscrape.com/catalogue/'
bookURLs= page1_url(allPages[0])
print(bookURLs) 
['http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', 'http://books.toscrape.com/catalogue/soumission_998/index.html', 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html', ... 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html']
  • Related