How to iterate through a list of URLs for BeautifulSoup Web Scraping?-CodePudding

My question is pretty simple: I am trying to iterate through a list of URLs and scrape the contents of each using Requests and BeautifulSoup. However, it looks as if the for loop is not properly assigning a new URL to the requests method and returns the contents of the first URL regardless of which iteration the loop is currently at. If any of you run this, you'll see that "print(url)" returns the proper URL, but the contents of "taglist" are always the results from URL #1. I'll paste my code down below in case one of you can spot the error(s). Thanks!

import requests
import os
import bs4
import pandas as pd
import numpy as np

urllist = ['https://www.stoneagetools.com/waterblast-tools-automated-equipment#exchanger','https://www.stoneagetools.com/waterblast-tools-automated-equipment#pipe','https://www.stoneagetools.com/waterblast-tools-automated-equipment#surface','https://www.stoneagetools.com/waterblast-tools-automated-equipment#tank','https://www.stoneagetools.com/waterblast-tools-automated-equipment#boiler','https://www.stoneagetools.com/waterblast-tools-automated-equipment#tools','https://www.stoneagetools.com/waterblast-tools-automated-equipment#swivels','https://www.stoneagetools.com/waterblast-tools-automated-equipment#accessories']

def Get_Names(urllist):
    
    endlist = []
    
    for url in urllist:
        
        templist = []
        
        print(url)
        
        response = requests.get(url)
        html = response.content
        soup = bs4.BeautifulSoup(html, 'lxml')
        
        taglist = soup.find_all('h3')
        del taglist[0] 
        
        for tag in taglist:
            
            tag_str = str(tag)
            
            clean1 = tag_str.replace('<h3>','')
            clean2 = clean1.replace('</h3>','')
            
            templist.append(clean2)
            
        endlist.append(templist)
      
    return endlist

CodePudding user response：

For what you want to do, your code doesn't have an error. The webpage you're scraping from is identical each time. What you're doing is going to different sections on that one page, which is what the # in each link does.

CodePudding user response：

All products are already on the page initial page. To get all products their sections as pandas DataFrame you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.stoneagetools.com/waterblast-tools-automated-equipment"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for div in soup.select("section.directory > div"):
    section = div.find_previous("h2").get_text(strip=True)
    name1 = div.h3.get_text(strip=True)
    name2 = div.h5.get_text(strip=True)
    all_data.append([section, name1, name2])

df = pd.DataFrame(all_data, columns=["Section", "Name1", "Name2"])
print(df.head(15).to_markdown(index=False))

Prints:

Section	Name1	Name2
Exchanger Cleaning	AutoPack 3L Sentinel	Smart Automated Equipment Kit
Exchanger Cleaning	AutoPack 3L	Automated Equipment Kit
Exchanger Cleaning	AutoPack 2L	Automated Equipment Kit
Exchanger Cleaning	AutoPack Compass	Automated Equipment Kit
Exchanger Cleaning	AutoPack PRO	Automated Equipment Kit
Exchanger Cleaning	AutoBox 2L	Dual flex-lancing system
Exchanger Cleaning	AutoBox 3L	Triple flex-lancing system
Exchanger Cleaning	ProDrive	AutoBox ABX-PRO hose feed tractor
Exchanger Cleaning	Bundle Blaster	Shell side exchanger cleaning
Exchanger Cleaning	Compass	Radial Indexer for ABX-PRO
Exchanger Cleaning	Confined Space Kit	For Compass Radial Indexer
Exchanger Cleaning	Fin FanAccessory	For AutoBox systems
Exchanger Cleaning	Hose Management System	For AutoBox hose tractors
Exchanger Cleaning	Lightweight Positioner	For AutoBox systems
Exchanger Cleaning	Rigid Lance Machine	For exchanger tubes