Home > Back-end >  How to iterate through a list of URLs for BeautifulSoup Web Scraping?
How to iterate through a list of URLs for BeautifulSoup Web Scraping?

Time:06-29

My question is pretty simple: I am trying to iterate through a list of URLs and scrape the contents of each using Requests and BeautifulSoup. However, it looks as if the for loop is not properly assigning a new URL to the requests method and returns the contents of the first URL regardless of which iteration the loop is currently at. If any of you run this, you'll see that "print(url)" returns the proper URL, but the contents of "taglist" are always the results from URL #1. I'll paste my code down below in case one of you can spot the error(s). Thanks!

import requests
import os
import bs4
import pandas as pd
import numpy as np

urllist = ['https://www.stoneagetools.com/waterblast-tools-automated-equipment#exchanger','https://www.stoneagetools.com/waterblast-tools-automated-equipment#pipe','https://www.stoneagetools.com/waterblast-tools-automated-equipment#surface','https://www.stoneagetools.com/waterblast-tools-automated-equipment#tank','https://www.stoneagetools.com/waterblast-tools-automated-equipment#boiler','https://www.stoneagetools.com/waterblast-tools-automated-equipment#tools','https://www.stoneagetools.com/waterblast-tools-automated-equipment#swivels','https://www.stoneagetools.com/waterblast-tools-automated-equipment#accessories']

def Get_Names(urllist):
    
    endlist = []
    
    for url in urllist:
        
        templist = []
        
        print(url)
        
        response = requests.get(url)
        html = response.content
        soup = bs4.BeautifulSoup(html, 'lxml')
        
        taglist = soup.find_all('h3')
        del taglist[0] 
        
        for tag in taglist:
            
            tag_str = str(tag)
            
            clean1 = tag_str.replace('<h3>','')
            clean2 = clean1.replace('</h3>','')
            
            templist.append(clean2)
            
        endlist.append(templist)
      
    return endlist

CodePudding user response:

For what you want to do, your code doesn't have an error. The webpage you're scraping from is identical each time. What you're doing is going to different sections on that one page, which is what the # in each link does.

CodePudding user response:

All products are already on the page initial page. To get all products their sections as pandas DataFrame you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.stoneagetools.com/waterblast-tools-automated-equipment"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for div in soup.select("section.directory > div"):
    section = div.find_previous("h2").get_text(strip=True)
    name1 = div.h3.get_text(strip=True)
    name2 = div.h5.get_text(strip=True)
    all_data.append([section, name1, name2])

df = pd.DataFrame(all_data, columns=["Section", "Name1", "Name2"])
print(df.head(15).to_markdown(index=False))

Prints:

Section Name1 Name2
Exchanger Cleaning AutoPack 3L Sentinel Smart Automated Equipment Kit
Exchanger Cleaning AutoPack 3L Automated Equipment Kit
Exchanger Cleaning AutoPack 2L Automated Equipment Kit
Exchanger Cleaning AutoPack Compass Automated Equipment Kit
Exchanger Cleaning AutoPack PRO Automated Equipment Kit
Exchanger Cleaning AutoBox 2L Dual flex-lancing system
Exchanger Cleaning AutoBox 3L Triple flex-lancing system
Exchanger Cleaning ProDrive AutoBox ABX-PRO hose feed tractor
Exchanger Cleaning Bundle Blaster Shell side exchanger cleaning
Exchanger Cleaning Compass Radial Indexer for ABX-PRO
Exchanger Cleaning Confined Space Kit For Compass Radial Indexer
Exchanger Cleaning Fin FanAccessory For AutoBox systems
Exchanger Cleaning Hose Management System For AutoBox hose tractors
Exchanger Cleaning Lightweight Positioner For AutoBox systems
Exchanger Cleaning Rigid Lance Machine For exchanger tubes
  • Related