I'm trying to scrap on the website conforama and to do so, I'm using BeautifulSoup. I'm trying to retrieve the price, the description, the rate, the url and the number of reviews of the item and to do so recursively on 3 pages.
At first, I import the required librairies
import csv
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
I define a first function: get_url that will format correctly the url with a specific search_term and return a url that's waiting to be formated with the right page number
def get_url(search_term):
template = 'https://www.conforama.fr/recherche-conforama/{}'
search_term = search_term.replace(' ',' ')
url = template.format(search_term)
url = '?P1-PRODUCTS[page]={}'
return url
I define a second one to get rid of some content that makes the data unreadable
def format_number(number):
new_number = ''
for n in number:
if n not in '0123456789€,.' : return new_number
new_number =n
I define a third function that will take a record and extract all the information that I need from it: its price, description, url, rating and number of reviews.
def extract_record(item):
print(item)
descriptions = item.find_all("a", {"class" : "bindEvent"})
description = descriptions[1].text.strip() ' ' descriptions[2].text.strip()
#get url of product
url = descriptions[2]['href']
print(url)
#number of reviews
nor = descriptions[3].text.strip()
nor = format_number(nor)
#rating
try:
ratings = item.find_all("span", {"class" : "stars"})
rating = ratings[0]['data']
except AttributeError:
return
#price
try:
prices = item.find_all("div", {"class" : "price-product"})
price = prices[0].text.strip()
except AttributeError:
return
price = format_number(price)
return (description, price, rating, nor, url)
In the end, I gather all the functions inside a main function that will allow me to iterate over all the pages I need to extract from
def main(search_term):
#product_name = search_term
driver = webdriver.Chrome(ChromeDriverManager().install())
records = []
url = get_url(search_term)
somme = 0
for page in range (1,4):
driver.get(url.format(page))
soup = BeautifulSoup(driver.page_source, 'html.parser')
print('longueur soup', len(soup))
print(soup)
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
print(len(results))
somme =len(results)
for result in results:
record = extract_record(result)
if record:
print(record)
records.append(record)
driver.close()
print('somme',somme)
Now the problem is that when I run all the commands one by one:
driver = webdriver.Chrome(ChromeDriverManager().install())
url = get_url('couch').format(1)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
item = results[0]
extracted = extract_record(item)
everything is great and the extract_record function returns exactly what I need it to. However, when I run the main function, this row of code:
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
does not return any result even though I know it does when I execute it outside of the main function
Has anyone had the same problem and do you have any idea of what I do wrong and how to fix it? Thanks a lot for reading and trying to answer
CodePudding user response:
What happens?
Main issue is that the elements need some time to be generated / displayed and they are not available in the moment you grab the driver.page_source
.
How to fix?
Use seleniums waits until presence of specific elements are located:
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'li.ais-Hits-item.box-product.fragItem div.price-product')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
Example
...
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
...
def main(search_term):
#product_name = search_term
driver = webdriver.Chrome(ChromeDriverManager().install())
records = []
url = get_url(search_term)
somme = 0
for page in range (1,4):
driver.get(url.format(page))
print(url.format(page))
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'li.ais-Hits-item.box-product.fragItem div.price-product')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
somme =len(results)
for result in results:
record = extract_record(result)
if record:
print(record)
records.append(record)
driver.close()
print('somme',somme)
main('matelas')