Hi guuys Im trying to scrape some information about a shoe of zalando and save the price, the title, the day and the hour in differents variables using Seleinum webdriver.This is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import csv
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
#Get the data of product 1 (If I change the /div/div[1]/div and I choose another number, it will get ther data of other shoe)
product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')
element_text = product_1.text
print(element_text)
When I print the element_text of the next code I get a lot of information about the product. I want to safe this in diferent variables so I tried one thing (keep reading)
109,95 € Nike Sportswear WMNS DUNK LOW CZ 10 de noviembre de 2022, 8:15 Recordármelo
So the thing is that after this little code works, I tried to split the data adding this code to then safe the diferent types of data in diferent variables, but I had a problem:
from selenium import webdriver
from selenium.webdriver.common.by import By
import csv
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
#Select product 1
product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')
element_text = product_1.text
#Split the data
element_text_split = element_text.split()
#Price 1 --> Result=109.95
price_1 =element_text_split[0]
print(price_1)
#Result=109,95
#Title 1 --> Result=€
title_1 =element_text_split[1]
print(title_1)
The result of this 2 prints are: "109.95" and "€"
I was thinking that the element_text_split[1] was Nike Sportswear but no, its the € sign because Im splitting the data by the spaces between them.
This is a big problem if I want to get the title of the shoe because the names doesnt have the sames spaces between them like : Nike Dunk Low Cz or Air Jordan One Mid 1
How can I resolve this problem??Thaanks
CodePudding user response:
I think you could be searching for something like this?
# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# We create the driver
DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
# We maximize the window
driver.maximize_window()
# We navigate to the url
url='https://www.zalando.es/release-calendar/zapatillas-mujer/'
driver.get(url)
# We save a list of elements that are products (search for that xpath in the page and you will see what kind of element it is)
products = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@id='release-calendar']//div[contains(@data-cid,'cid')]")))
# We make a loop for that list and for each of then we take the price, the brand, the model and the date.
for i, product in enumerate(products):
price = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[2]"))).text
brand = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[3]"))).text
model = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[4]"))).text
date = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']/div[5]"))).text
url = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']//a"))).get_attribute("href")
image = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i 1}']//img"))).get_attribute("src")
print(f"""{price}
{brand}
{model}
{date}
{url}
{image}
""")
CodePudding user response:
One idea is to take a look at the variable, element_text, for many different products, and decide a different way to split the text - the split method can take in a smaller string to split the longer string by.
If that doesnt work, you can also iterate through the element_text_split variable (which is just a list of strings), and break up that list of strings by looking for certain smaller strings or by using regex.
For example, to find the prices, you could look for numbers, a period, then numbers again. I'm assuming the name of the product is either before or after. Gl!
CodePudding user response:
You can grab the required data powerful way using selenium with bs4
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
d = []
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,"html.parser")
price= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu div')]
#print(price)
title= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu div div div')]
#print(title)
date = [x.get_text(strip=True).split(',')[0] for x in soup.select('.Wqd6Qu div div div div')]
#print(date)
hour = [x.get_text(strip=True).split(',')[1] for x in soup.select('.Wqd6Qu div div div div')]
#print(hour)
cols = ['title', 'price', 'date', 'hour']
df = pd.DataFrame(data=list(zip(title,price,date,hour)), columns=cols)
print(df)
Output:
title price date hour
0 WMNS DUNK LOW CZ 109,95 € 10 de noviembre de 2022 14:15
1 HYPERTURF ADVENTURE 139,95 € 11 de noviembre de 2022 14:00
2 W AIR MAX 95 ESS 189,95 € 11 de noviembre de 2022 14:00
3 CITY CLASSIC 119,95 € 11 de noviembre de 2022 14:00
4 CITY CLASSIC 119,95 € 11 de noviembre de 2022 14:00
5 WMNS AIR 1 MID 129,95 € 11 de noviembre de 2022 14:15
6 DUNK LOW NEXT NATURE 109,95 € 11 de noviembre de 2022 14:15
7 CROSS WOMEN 295,00 € 14 de noviembre de 2022 14:00
8 CROSS WOMEN 295,00 € 14 de noviembre de 2022 14:00
9 CROSS WOMEN 295,00 € 14 de noviembre de 2022 14:00
10 W DUNK HIGH 119,95 € 14 de noviembre de 2022 14:15
11 MT410 99,95 € 16 de noviembre de 2022 14:00
12 MT410 99,95 € 16 de noviembre de 2022 14:00
13 MT410 99,95 € 16 de noviembre de 2022 14:00
14 MT410 99,95 € 16 de noviembre de 2022 14:00
15 MT410 94,95 € 16 de noviembre de 2022 14:00
16 WL574 109,95 € 18 de noviembre de 2022 14:00
17 WS327 119,95 € 18 de noviembre de 2022 14:00