I am using BeautifulSoup and Selenium to extract web data (beautifulsoup to parse the HTML page and Selenium to click Next to get to the next list of items on the page).
What I need the code to do is:
- Get the current URL and retrieve the information I am looking to scrape
- Click Next to go to the next page within the same URL
- Retrieve the information from page 2
- Click next to go to page 3...
What my current code is doing is:
- Get the current URL and retrieving the information I am looking to scrape correctly
- Clicking Next to go to the next page correctly (I can see this happening in headless mode)
- Still retrieving the information from page 1
- Clicking next to go to page 3 correctly
I assume this is because I am using some steps in the wrong sequence in my code. An abridged version is below. Is what I'm doing wrong visible?
import requests
from bs4 import BeautifulSoup
from csv import writer
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By
URL = "https://www.theitdepot.com/products-Motherboards_C13.html"
wd = webdriver.Chrome(ChromeDriverManager().install())
wd.get(URL)
running = True
while running:
page = requests.get(URL, verify = False)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="filter_display")
item_elements = results.find_all("div", class_="product-details text-md-left flex-grow-1")
with open('data.csv', 'a', encoding='utf8', newline='') as f:
thewriter = writer(f)
for item_element in item_elements:
#code to retrieve information and write to CSV here
name_element = item_element.find("div", class_="card-text px-2 py-1 font-size85 product_title")
name = str(name_element.text)
print (name)
next = wd.find_element(by=By.XPATH, value="//*[contains(text(), 'Next →')]")
wd.execute_script("arguments[0].click();", next)
time.sleep(10) #prevent ban
f.close()
(note: I know this is currently an infinite loop, I intend to add the logic to know when all the pages are done)
CodePudding user response:
For this easy task you can use Selenium itself instead of BeautifulSoup. Moreover, you can save the product names in a list and export it using numpy
. I prefer numpy because it lets you replace the block of code with open(...) as f: etc.
with a simple line.
number_of_pages_to_scrape = 5
names = []
for i in range(number_of_pages_to_scrape):
items = driver.find_elements(By.CSS_SELECTOR, "div[class='card-text px-2 py-1 font-size85 product_title']")
for item in items:
print(item.text)
names.append(item.text)
driver.find_element(By.XPATH, "//*[contains(text(), 'Next')]").click()
time.sleep(10)
import numpy
numpy.savetxt("data.csv", names, fmt ='%s')