My question is pretty simple: I am trying to iterate through a list of URLs and scrape the contents of each using Requests and BeautifulSoup. However, it looks as if the for loop is not properly assigning a new URL to the requests method and returns the contents of the first URL regardless of which iteration the loop is currently at. If any of you run this, you'll see that "print(url)" returns the proper URL, but the contents of "taglist" are always the results from URL #1. I'll paste my code down below in case one of you can spot the error(s). Thanks!
import requests
import os
import bs4
import pandas as pd
import numpy as np
urllist = ['https://www.stoneagetools.com/waterblast-tools-automated-equipment#exchanger','https://www.stoneagetools.com/waterblast-tools-automated-equipment#pipe','https://www.stoneagetools.com/waterblast-tools-automated-equipment#surface','https://www.stoneagetools.com/waterblast-tools-automated-equipment#tank','https://www.stoneagetools.com/waterblast-tools-automated-equipment#boiler','https://www.stoneagetools.com/waterblast-tools-automated-equipment#tools','https://www.stoneagetools.com/waterblast-tools-automated-equipment#swivels','https://www.stoneagetools.com/waterblast-tools-automated-equipment#accessories']
def Get_Names(urllist):
endlist = []
for url in urllist:
templist = []
print(url)
response = requests.get(url)
html = response.content
soup = bs4.BeautifulSoup(html, 'lxml')
taglist = soup.find_all('h3')
del taglist[0]
for tag in taglist:
tag_str = str(tag)
clean1 = tag_str.replace('<h3>','')
clean2 = clean1.replace('</h3>','')
templist.append(clean2)
endlist.append(templist)
return endlist
CodePudding user response:
For what you want to do, your code doesn't have an error. The webpage you're scraping from is identical each time. What you're doing is going to different sections on that one page, which is what the #
in each link does.
CodePudding user response:
All products are already on the page initial page. To get all products their sections as pandas DataFrame you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.stoneagetools.com/waterblast-tools-automated-equipment"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for div in soup.select("section.directory > div"):
section = div.find_previous("h2").get_text(strip=True)
name1 = div.h3.get_text(strip=True)
name2 = div.h5.get_text(strip=True)
all_data.append([section, name1, name2])
df = pd.DataFrame(all_data, columns=["Section", "Name1", "Name2"])
print(df.head(15).to_markdown(index=False))
Prints:
Section | Name1 | Name2 |
---|---|---|
Exchanger Cleaning | AutoPack 3L Sentinel | Smart Automated Equipment Kit |
Exchanger Cleaning | AutoPack 3L | Automated Equipment Kit |
Exchanger Cleaning | AutoPack 2L | Automated Equipment Kit |
Exchanger Cleaning | AutoPack Compass | Automated Equipment Kit |
Exchanger Cleaning | AutoPack PRO | Automated Equipment Kit |
Exchanger Cleaning | AutoBox 2L | Dual flex-lancing system |
Exchanger Cleaning | AutoBox 3L | Triple flex-lancing system |
Exchanger Cleaning | ProDrive | AutoBox ABX-PRO hose feed tractor |
Exchanger Cleaning | Bundle Blaster | Shell side exchanger cleaning |
Exchanger Cleaning | Compass | Radial Indexer for ABX-PRO |
Exchanger Cleaning | Confined Space Kit | For Compass Radial Indexer |
Exchanger Cleaning | Fin FanAccessory | For AutoBox systems |
Exchanger Cleaning | Hose Management System | For AutoBox hose tractors |
Exchanger Cleaning | Lightweight Positioner | For AutoBox systems |
Exchanger Cleaning | Rigid Lance Machine | For exchanger tubes |