BeautifulSoup find_all function doesn't work inside of main-CodePudding

I'm trying to scrap on the website conforama and to do so, I'm using BeautifulSoup. I'm trying to retrieve the price, the description, the rate, the url and the number of reviews of the item and to do so recursively on 3 pages.

At first, I import the required librairies

import csv
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

I define a first function: get_url that will format correctly the url with a specific search_term and return a url that's waiting to be formated with the right page number

def get_url(search_term):
    template = 'https://www.conforama.fr/recherche-conforama/{}'
    
    search_term = search_term.replace(' ',' ')
    
    url = template.format(search_term)
    
    url = '?P1-PRODUCTS[page]={}'
    
    return url

I define a second one to get rid of some content that makes the data unreadable

def format_number(number):
    new_number = ''
    for n in number:
        if n not in '0123456789€,.' : return new_number
        new_number =n

I define a third function that will take a record and extract all the information that I need from it: its price, description, url, rating and number of reviews.

def extract_record(item):
    print(item)
    descriptions = item.find_all("a", {"class" : "bindEvent"})

    description = descriptions[1].text.strip()   ' '   descriptions[2].text.strip()

    #get url of product
    url = descriptions[2]['href']
    print(url)

    #number of reviews
    nor = descriptions[3].text.strip()
    nor = format_number(nor)

    #rating
    try:
        ratings = item.find_all("span", {"class" : "stars"})
        rating = ratings[0]['data']
    except AttributeError:
        return

    #price
    try:
        prices = item.find_all("div", {"class" : "price-product"})
        price = prices[0].text.strip()
    except AttributeError:
        return
    price = format_number(price)
    
    return (description, price, rating, nor, url)

In the end, I gather all the functions inside a main function that will allow me to iterate over all the pages I need to extract from

def main(search_term):
    #product_name = search_term
    
    driver = webdriver.Chrome(ChromeDriverManager().install())
    records = []
    url = get_url(search_term)
    somme = 0
    for page in range (1,4):
       driver.get(url.format(page))
       soup = BeautifulSoup(driver.page_source, 'html.parser')
       print('longueur soup', len(soup))
       print(soup)
       results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
       print(len(results))
       somme =len(results)
       for result in results:
           record = extract_record(result)
           if record:
               print(record)
               records.append(record)
    driver.close()
    print('somme',somme)

Now the problem is that when I run all the commands one by one:

driver = webdriver.Chrome(ChromeDriverManager().install())
url = get_url('couch').format(1)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
item = results[0]
extracted = extract_record(item)

everything is great and the extract_record function returns exactly what I need it to. However, when I run the main function, this row of code:

results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})

does not return any result even though I know it does when I execute it outside of the main function

Has anyone had the same problem and do you have any idea of what I do wrong and how to fix it? Thanks a lot for reading and trying to answer

CodePudding user response：

What happens?

Main issue is that the elements need some time to be generated / displayed and they are not available in the moment you grab the driver.page_source.

How to fix?

Use seleniums waits until presence of specific elements are located:

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'li.ais-Hits-item.box-product.fragItem div.price-product')))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})

Example

...
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

...

def main(search_term):
    #product_name = search_term
    
    driver = webdriver.Chrome(ChromeDriverManager().install())
    records = []
    url = get_url(search_term)
    somme = 0
    for page in range (1,4):
        driver.get(url.format(page))
        print(url.format(page))
        wait = WebDriverWait(driver, 10)
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'li.ais-Hits-item.box-product.fragItem div.price-product')))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        results = soup.find_all('li', {'class' : 'ais-Hits-item box-product fragItem'})
        somme =len(results)
        for result in results:
            record = extract_record(result)
            if record:
                print(record)
                records.append(record)
    driver.close()
    print('somme',somme)

main('matelas')