Home > Net >  Beautiful soup and selenium scrolling issue
Beautiful soup and selenium scrolling issue

Time:04-02

This code scrolls down to the bottom of the given link in the code, and should append all the URLs of the recipes' pages given in there. But it only URLs randomly up-to 70/80/90. I don't understand why that is happening.

import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import numpy as np
from urllib.parse import urljoin
driver = webdriver.Firefox()
driver.get("https://www.kitchenstories.com/en/categories/vegan-dishes")
time.sleep(2) 
scroll_pause_time = 0.1 
screen_height = driver.execute_script("return window.screen.height;") 
i = 1
screen_height = screen_height/16
while True:
    driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
    i  = 1
    time.sleep(scroll_pause_time)
    scroll_height = driver.execute_script("return document.body.scrollHeight;")
    print(scroll_height)
    print(screen_height)
    if (screen_height) * i > scroll_height*2:
        break
urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for parent in soup.find_all('li',class_ ="cursor-pointer col-span-6 col-start-auto w-full sm:col-span-4 md:col-span-3"):
    a_tag = parent.find('li',class_ ="cursor-pointer col-span-6 col-start-auto w-full sm:col-span-4 md:col-span-3")
    base = "https://www.kitchenstories.com/en/categories/vegan-dishes"
    link = parent.a['href']
    url = urljoin(base,link)
    urls.append(url)
print(len(url))
print(urls)

what I am expecting out of this code is for it to give me a list of all the 522 URLs so that I can further scrape off the ingredients for those respective pages in the website.

CodePudding user response:

Technically, it should return 522 (or when I look at the site it says 523). When you scroll, 24 additional recipes pop up. When it gets to the end, for whatever reason, it returns 19 recipes (instead of the 523), and the last scroll returns 0. So by the end it actually shorts you 24 recipes (no idea why).

However, regardless of that (I still think 499 recipes is better than 70/80/90), you can get the data from the api request:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://hisovj74hs-dsn.algolia.net/1/indexes/ks_articles_recipes/query'
payload = {
    'x-algolia-agent': 'Algolia for JavaScript (4.12.2); Browser (lite)',
'x-algolia-api-key': 'bbf355ef9a83a85c9e0a02c0fae58c8f',
'x-algolia-application-id': 'HISOVJ74HS'}


page = 1
data = '{"query":"","responseFields":["hits","nbHits","nbPages","page","nbHits"],"filters":"categories.level1:vegan-dishes","getRankingInfo":false,"attributesToHighlight":[],"attributesToRetrieve":["*","-tags","-categories","-additional_content","-_highlightResult"],"facetFilters":["content_type:recipe","language:en"],"page":%s,"hitsPerPage":24}' %page


jsonData = requests.post(url, params=payload, data=data).json()
totPages = jsonData['nbPages']

recipes = []
for page in range(1,totPages 1):
    print(f'Page: {page} of {totPages}')
    if page == 1:
        print(len(jsonData['hits']))
        recipes  = jsonData['hits']
    else:
        data = '{"query":"","responseFields":["hits","nbHits","nbPages","page","nbHits"],"filters":"categories.level1:vegan-dishes","getRankingInfo":false,"attributesToHighlight":[],"attributesToRetrieve":["*","-tags","-categories","-additional_content","-_highlightResult"],"facetFilters":["content_type:recipe","language:en"],"page":%s,"hitsPerPage":24}' %page 
        jsonData = requests.post(url, params=payload, data=data).json()
        print(len(jsonData['hits']))
        recipes  = jsonData['hits']
    
df = pd.json_normalize(recipes, meta=['api_content'])

Output:

print(df['api_content.url'])
0      https://www.kitchenstories.com/en/recipes/suns...
1      https://www.kitchenstories.com/en/recipes/hot-...
2      https://www.kitchenstories.com/en/recipes/vale...
3      https://www.kitchenstories.com/en/recipes/quic...
4      https://www.kitchenstories.com/en/recipes/thai...
                       
494    https://www.kitchenstories.com/en/recipes/home...
495    https://www.kitchenstories.com/en/recipes/oven...
496    https://www.kitchenstories.com/en/recipes/appl...
497    https://www.kitchenstories.com/en/recipes/home...
498    https://www.kitchenstories.com/en/recipes/vega...
Name: api_content.url, Length: 499, dtype: object

CodePudding user response:

What happens?

The main problem is that the website does not work as expected for the user and therefore not for selenium. If you scroll down manually, you will see that the website changes to https://www.kitchenstories.com/en/categories on the last scroll before you can see the final chunk of recipes.

How to deal with this?

@chitown88 has already shown you a very good approach to retrieve well-structured information.

I would therefore like to additively address the approach supported by Selenium to get 523 out of 523 results.

Because of the scrolling behaviour I mentioned, we need to control our process better and so we first access the number of total results and use that in range() for our for loop. In each iteration, we only scroll the last result into view, so we don't have to deal with the height of the screen or the body:

...
numResults = int(driver.find_element(By.CSS_SELECTOR, 'article>section>p').text.split()[0])

for i in range(1,int(numResults/24)):

    driver.execute_script("arguments[0].scrollIntoView()", driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')[-1])
    time.sleep(0.5)
    driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')

soup = BeautifulSoup(driver.page_source)
...
Example
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time

url = 'https://www.kitchenstories.com/en/categories/vegan-dishes'

options = Options()
options.headless = True
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=options)
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 5)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[data-test="ks-card"]')))

numResults = int(driver.find_element(By.CSS_SELECTOR, 'article>section>p').text.split()[0])

for i in range(1,int(numResults/24)):

    driver.execute_script("arguments[0].scrollIntoView()", driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')[-1])
    time.sleep(0.5)

soup = BeautifulSoup(driver.page_source)

data = []

for e in soup.select('li[data-test="ks-card"]'):
    data.append({
        'title':e.h3.text,
        'time':t.parent.text if (t := e.time) else None,
        'url':e.a.get('href'),
        'img': i['src'] if(i := e.img) else None
    })
data

pd.DataFrame(data)
Output
title time url img
0 One-pot creamy chickpea and spinach curry 30 min. https://www.kitchenstories.com/en/recipes/one-pot-creamy-chickpea-and-spinach-curry https://images.kitchenstories.io/wagtailOriginalImages/R2550-photo-final-3/R2550-photo-final-3-small-portrait-150.jpg
1 Vegan mushroom goulash with dumplings 60 min. https://www.kitchenstories.com/en/recipes/vegan-mushroom-goulash-with-dumplings https://images.kitchenstories.io/wagtailOriginalImages/R2472-photo-final-2/R2472-photo-final-2-small-portrait-150.jpg
2 Creamy coconut, pumpkin, and lentil stew 45 min. https://www.kitchenstories.com/en/recipes/creamy-coconut-pumpkin-and-lentil-stew https://images.kitchenstories.io/wagtailOriginalImages/R2549-photo-final-1/R2549-photo-final-1-small-portrait-150.jpg
3 TikTok's viral vegan green goddess salad 30 min. https://www.kitchenstories.com/en/recipes/tiktok-s-vegan-green-goddess-salad https://images.kitchenstories.io/wagtailOriginalImages/R2617-photo-final-1/R2617-photo-final-1-small-portrait-150.jpg
4 Veggie burrito bowl 30 min. https://www.kitchenstories.com/en/recipes/veggie-burrito-bowl https://images.kitchenstories.io/wagtailOriginalImages/R2453-photo-title-1/R2453-photo-title-1-small-portrait-150.jpg
... ... ... ... ...
518 Homemade applesauce 40 min. https://www.kitchenstories.com/en/recipes/homemade-applesauce https://images.kitchenstories.io/recipeImages/H290-photo-final-4x3/H290-photo-final-4x3-small-portrait-150.jpg
519 Oven-roasted rosemary potatoes 40 min. https://www.kitchenstories.com/en/recipes/oven-roasted-rosemary-potatoes https://images.kitchenstories.io/recipeImages/00_091_OvenRoastedRosemaryPotatoes_4x3/00_091_OvenRoastedRosemaryPotatoes_4x3-small-portrait-150.jpg
520 Apple pear compote 30 min. https://www.kitchenstories.com/en/recipes/apple-pear-compote https://images.kitchenstories.io/recipeImages/00_115_ApplePearCompote_4x3/00_115_ApplePearCompote_4x3-small-portrait-150.jpg
521 Homemade pumpkin purée 60 min. https://www.kitchenstories.com/en/recipes/homemade-pumpkin-puree https://images.kitchenstories.io/recipeImages/H244-photo-final-4x3/H244-photo-final-4x3-small-portrait-150.jpg
522 Vegan sponge cake base 30 min. https://www.kitchenstories.com/en/recipes/vegan-shortcake https://images.kitchenstories.io/recipeImages/00_256_HowToMakeVeganSpongecake/00_256_HowToMakeVeganSpongecake-small-portrait-150.jpg
  • Related