Beautiful soup and selenium scrolling issue-CodePudding

This code scrolls down to the bottom of the given link in the code, and should append all the URLs of the recipes' pages given in there. But it only URLs randomly up-to 70/80/90. I don't understand why that is happening.

import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import numpy as np
from urllib.parse import urljoin
driver = webdriver.Firefox()
driver.get("https://www.kitchenstories.com/en/categories/vegan-dishes")
time.sleep(2) 
scroll_pause_time = 0.1 
screen_height = driver.execute_script("return window.screen.height;") 
i = 1
screen_height = screen_height/16
while True:
    driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
    i  = 1
    time.sleep(scroll_pause_time)
    scroll_height = driver.execute_script("return document.body.scrollHeight;")
    print(scroll_height)
    print(screen_height)
    if (screen_height) * i > scroll_height*2:
        break
urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for parent in soup.find_all('li',class_ ="cursor-pointer col-span-6 col-start-auto w-full sm:col-span-4 md:col-span-3"):
    a_tag = parent.find('li',class_ ="cursor-pointer col-span-6 col-start-auto w-full sm:col-span-4 md:col-span-3")
    base = "https://www.kitchenstories.com/en/categories/vegan-dishes"
    link = parent.a['href']
    url = urljoin(base,link)
    urls.append(url)
print(len(url))
print(urls)

what I am expecting out of this code is for it to give me a list of all the 522 URLs so that I can further scrape off the ingredients for those respective pages in the website.

CodePudding user response：

Technically, it should return 522 (or when I look at the site it says 523). When you scroll, 24 additional recipes pop up. When it gets to the end, for whatever reason, it returns 19 recipes (instead of the 523), and the last scroll returns 0. So by the end it actually shorts you 24 recipes (no idea why).

However, regardless of that (I still think 499 recipes is better than 70/80/90), you can get the data from the api request:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://hisovj74hs-dsn.algolia.net/1/indexes/ks_articles_recipes/query'
payload = {
    'x-algolia-agent': 'Algolia for JavaScript (4.12.2); Browser (lite)',
'x-algolia-api-key': 'bbf355ef9a83a85c9e0a02c0fae58c8f',
'x-algolia-application-id': 'HISOVJ74HS'}


page = 1
data = '{"query":"","responseFields":["hits","nbHits","nbPages","page","nbHits"],"filters":"categories.level1:vegan-dishes","getRankingInfo":false,"attributesToHighlight":[],"attributesToRetrieve":["*","-tags","-categories","-additional_content","-_highlightResult"],"facetFilters":["content_type:recipe","language:en"],"page":%s,"hitsPerPage":24}' %page


jsonData = requests.post(url, params=payload, data=data).json()
totPages = jsonData['nbPages']

recipes = []
for page in range(1,totPages 1):
    print(f'Page: {page} of {totPages}')
    if page == 1:
        print(len(jsonData['hits']))
        recipes  = jsonData['hits']
    else:
        data = '{"query":"","responseFields":["hits","nbHits","nbPages","page","nbHits"],"filters":"categories.level1:vegan-dishes","getRankingInfo":false,"attributesToHighlight":[],"attributesToRetrieve":["*","-tags","-categories","-additional_content","-_highlightResult"],"facetFilters":["content_type:recipe","language:en"],"page":%s,"hitsPerPage":24}' %page 
        jsonData = requests.post(url, params=payload, data=data).json()
        print(len(jsonData['hits']))
        recipes  = jsonData['hits']
    
df = pd.json_normalize(recipes, meta=['api_content'])

Output:

print(df['api_content.url'])
0      https://www.kitchenstories.com/en/recipes/suns...
1      https://www.kitchenstories.com/en/recipes/hot-...
2      https://www.kitchenstories.com/en/recipes/vale...
3      https://www.kitchenstories.com/en/recipes/quic...
4      https://www.kitchenstories.com/en/recipes/thai...
                       
494    https://www.kitchenstories.com/en/recipes/home...
495    https://www.kitchenstories.com/en/recipes/oven...
496    https://www.kitchenstories.com/en/recipes/appl...
497    https://www.kitchenstories.com/en/recipes/home...
498    https://www.kitchenstories.com/en/recipes/vega...
Name: api_content.url, Length: 499, dtype: object

CodePudding user response：

What happens?

The main problem is that the website does not work as expected for the user and therefore not for selenium. If you scroll down manually, you will see that the website changes to https://www.kitchenstories.com/en/categories on the last scroll before you can see the final chunk of recipes.

How to deal with this?

@chitown88 has already shown you a very good approach to retrieve well-structured information.

I would therefore like to additively address the approach supported by Selenium to get 523 out of 523 results.

Because of the scrolling behaviour I mentioned, we need to control our process better and so we first access the number of total results and use that in range() for our for loop. In each iteration, we only scroll the last result into view, so we don't have to deal with the height of the screen or the body:

...
numResults = int(driver.find_element(By.CSS_SELECTOR, 'article>section>p').text.split()[0])

for i in range(1,int(numResults/24)):

    driver.execute_script("arguments[0].scrollIntoView()", driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')[-1])
    time.sleep(0.5)
    driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')

soup = BeautifulSoup(driver.page_source)
...

Example

from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time

url = 'https://www.kitchenstories.com/en/categories/vegan-dishes'

options = Options()
options.headless = True
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=options)
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 5)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[data-test="ks-card"]')))

numResults = int(driver.find_element(By.CSS_SELECTOR, 'article>section>p').text.split()[0])

for i in range(1,int(numResults/24)):

    driver.execute_script("arguments[0].scrollIntoView()", driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')[-1])
    time.sleep(0.5)

soup = BeautifulSoup(driver.page_source)

data = []

for e in soup.select('li[data-test="ks-card"]'):
    data.append({
        'title':e.h3.text,
        'time':t.parent.text if (t := e.time) else None,
        'url':e.a.get('href'),
        'img': i['src'] if(i := e.img) else None
    })
data

pd.DataFrame(data)

Output

	title	time	url	img
0	One-pot creamy chickpea and spinach curry	30 min.	https://www.kitchenstories.com/en/recipes/one-pot-creamy-chickpea-and-spinach-curry	https://images.kitchenstories.io/wagtailOriginalImages/R2550-photo-final-3/R2550-photo-final-3-small-portrait-150.jpg
1	Vegan mushroom goulash with dumplings	60 min.	https://www.kitchenstories.com/en/recipes/vegan-mushroom-goulash-with-dumplings	https://images.kitchenstories.io/wagtailOriginalImages/R2472-photo-final-2/R2472-photo-final-2-small-portrait-150.jpg
2	Creamy coconut, pumpkin, and lentil stew	45 min.	https://www.kitchenstories.com/en/recipes/creamy-coconut-pumpkin-and-lentil-stew	https://images.kitchenstories.io/wagtailOriginalImages/R2549-photo-final-1/R2549-photo-final-1-small-portrait-150.jpg
3	TikTok's viral vegan green goddess salad	30 min.	https://www.kitchenstories.com/en/recipes/tiktok-s-vegan-green-goddess-salad	https://images.kitchenstories.io/wagtailOriginalImages/R2617-photo-final-1/R2617-photo-final-1-small-portrait-150.jpg
4	Veggie burrito bowl	30 min.	https://www.kitchenstories.com/en/recipes/veggie-burrito-bowl	https://images.kitchenstories.io/wagtailOriginalImages/R2453-photo-title-1/R2453-photo-title-1-small-portrait-150.jpg
...	...	...	...	...
518	Homemade applesauce	40 min.	https://www.kitchenstories.com/en/recipes/homemade-applesauce	https://images.kitchenstories.io/recipeImages/H290-photo-final-4x3/H290-photo-final-4x3-small-portrait-150.jpg
519	Oven-roasted rosemary potatoes	40 min.	https://www.kitchenstories.com/en/recipes/oven-roasted-rosemary-potatoes	https://images.kitchenstories.io/recipeImages/00_091_OvenRoastedRosemaryPotatoes_4x3/00_091_OvenRoastedRosemaryPotatoes_4x3-small-portrait-150.jpg
520	Apple pear compote	30 min.	https://www.kitchenstories.com/en/recipes/apple-pear-compote	https://images.kitchenstories.io/recipeImages/00_115_ApplePearCompote_4x3/00_115_ApplePearCompote_4x3-small-portrait-150.jpg
521	Homemade pumpkin purée	60 min.	https://www.kitchenstories.com/en/recipes/homemade-pumpkin-puree	https://images.kitchenstories.io/recipeImages/H244-photo-final-4x3/H244-photo-final-4x3-small-portrait-150.jpg
522	Vegan sponge cake base	30 min.	https://www.kitchenstories.com/en/recipes/vegan-shortcake	https://images.kitchenstories.io/recipeImages/00_256_HowToMakeVeganSpongecake/00_256_HowToMakeVeganSpongecake-small-portrait-150.jpg