Home > Software design >  Problem scraping title, price and date from a site using selenium
Problem scraping title, price and date from a site using selenium

Time:11-10

Hi guuys Im trying to scrape some information about a shoe of zalando and save the price, the title, the day and the hour in differents variables using Seleinum webdriver.This is my code:

from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')

#Get the data of product 1 (If I change the /div/div[1]/div and I choose another number, it will get ther data of other shoe)

product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')

element_text = product_1.text


print(element_text)

When I print the element_text of the next code I get a lot of information about the product. I want to safe this in diferent variables so I tried one thing (keep reading)

109,95 € Nike Sportswear WMNS DUNK LOW CZ 10 de noviembre de 2022, 8:15 Recordármelo

So the thing is that after this little code works, I tried to split the data adding this code to then safe the diferent types of data in diferent variables, but I had a problem:

from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')

#Select product 1 

product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')

element_text = product_1.text

#Split the data 
element_text_split = element_text.split()  

#Price 1 --> Result=109.95

price_1 =element_text_split[0]
print(price_1)
#Result=109,95

#Title 1 --> Result=€

title_1 =element_text_split[1]
print(title_1)

The result of this 2 prints are: "109.95" and "€"

I was thinking that the element_text_split[1] was Nike Sportswear but no, its the € sign because Im splitting the data by the spaces between them.

This is a big problem if I want to get the title of the shoe because the names doesnt have the sames spaces between them like : Nike Dunk Low Cz or Air Jordan One Mid 1

How can I resolve this problem??Thaanks

CodePudding user response:

I think you could be searching for something like this?

# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# We create the driver
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)

# We maximize the window
driver.maximize_window()

# We navigate to the url
url='https://www.zalando.es/release-calendar/zapatillas-mujer/'
driver.get(url)

# We save a list of elements that are products (search for that xpath in the page and you will see what kind of element it is)
products = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@id='release-calendar']//div[contains(@data-cid,'cid')]")))

# We make a loop for that list and for each of then we take the price, the brand, the model and the date.
for i, product in enumerate(products):
    price = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[2]"))).text
    brand = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[3]"))).text
    model = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[4]"))).text
    date = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[5]"))).text
    url = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']//a"))).get_attribute("href")
    image = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']//img"))).get_attribute("src")
    print(f"""{price}
{brand}
{model}
{date}
{url}
{image}
""")

CodePudding user response:

One idea is to take a look at the variable, element_text, for many different products, and decide a different way to split the text - the split method can take in a smaller string to split the longer string by.

If that doesnt work, you can also iterate through the element_text_split variable (which is just a list of strings), and break up that list of strings by looking for certain smaller strings or by using regex.

For example, to find the prices, you could look for numbers, a period, then numbers again. I'm assuming the name of the product is either before or after. Gl!

CodePudding user response:

You can grab the required data powerful way using selenium with bs4

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)

d = []
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
driver.maximize_window()
time.sleep(5)

soup = BeautifulSoup(driver.page_source,"html.parser")
price= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu   div')]
#print(price)
title= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu   div   div   div')]
#print(title)

date = [x.get_text(strip=True).split(',')[0] for x in soup.select('.Wqd6Qu   div   div   div   div')]
#print(date)


hour = [x.get_text(strip=True).split(',')[1] for x in soup.select('.Wqd6Qu   div   div   div   div')]
#print(hour)


cols = ['title', 'price', 'date', 'hour']
  
df = pd.DataFrame(data=list(zip(title,price,date,hour)), columns=cols)
print(df)

Output:

                   title     price             date           hour
0       WMNS DUNK LOW CZ  109,95 €  10 de noviembre de 2022   14:15
1    HYPERTURF ADVENTURE  139,95 €  11 de noviembre de 2022   14:00
2       W AIR MAX 95 ESS  189,95 €  11 de noviembre de 2022   14:00
3           CITY CLASSIC  119,95 €  11 de noviembre de 2022   14:00
4           CITY CLASSIC  119,95 €  11 de noviembre de 2022   14:00
5         WMNS AIR 1 MID  129,95 €  11 de noviembre de 2022   14:15
6   DUNK LOW NEXT NATURE  109,95 €  11 de noviembre de 2022   14:15
7            CROSS WOMEN  295,00 €  14 de noviembre de 2022   14:00
8            CROSS WOMEN  295,00 €  14 de noviembre de 2022   14:00
9            CROSS WOMEN  295,00 €  14 de noviembre de 2022   14:00
10           W DUNK HIGH  119,95 €  14 de noviembre de 2022   14:15
11                 MT410   99,95 €  16 de noviembre de 2022   14:00
12                 MT410   99,95 €  16 de noviembre de 2022   14:00
13                 MT410   99,95 €  16 de noviembre de 2022   14:00
14                 MT410   99,95 €  16 de noviembre de 2022   14:00
15                 MT410   94,95 €  16 de noviembre de 2022   14:00
16                 WL574  109,95 €  18 de noviembre de 2022   14:00
17                 WS327  119,95 €  18 de noviembre de 2022   14:00
  • Related