This code scrolls down to the bottom of the given link in the code, and should append all the URLs of the recipes' pages given in there. But it only URLs randomly up-to 70/80/90. I don't understand why that is happening.
import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import numpy as np
from urllib.parse import urljoin
driver = webdriver.Firefox()
driver.get("https://www.kitchenstories.com/en/categories/vegan-dishes")
time.sleep(2)
scroll_pause_time = 0.1
screen_height = driver.execute_script("return window.screen.height;")
i = 1
screen_height = screen_height/16
while True:
driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
i = 1
time.sleep(scroll_pause_time)
scroll_height = driver.execute_script("return document.body.scrollHeight;")
print(scroll_height)
print(screen_height)
if (screen_height) * i > scroll_height*2:
break
urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for parent in soup.find_all('li',class_ ="cursor-pointer col-span-6 col-start-auto w-full sm:col-span-4 md:col-span-3"):
a_tag = parent.find('li',class_ ="cursor-pointer col-span-6 col-start-auto w-full sm:col-span-4 md:col-span-3")
base = "https://www.kitchenstories.com/en/categories/vegan-dishes"
link = parent.a['href']
url = urljoin(base,link)
urls.append(url)
print(len(url))
print(urls)
what I am expecting out of this code is for it to give me a list of all the 522 URLs so that I can further scrape off the ingredients for those respective pages in the website.
CodePudding user response:
Technically, it should return 522 (or when I look at the site it says 523). When you scroll, 24 additional recipes pop up. When it gets to the end, for whatever reason, it returns 19 recipes (instead of the 523), and the last scroll returns 0. So by the end it actually shorts you 24 recipes (no idea why).
However, regardless of that (I still think 499 recipes is better than 70/80/90), you can get the data from the api request:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://hisovj74hs-dsn.algolia.net/1/indexes/ks_articles_recipes/query'
payload = {
'x-algolia-agent': 'Algolia for JavaScript (4.12.2); Browser (lite)',
'x-algolia-api-key': 'bbf355ef9a83a85c9e0a02c0fae58c8f',
'x-algolia-application-id': 'HISOVJ74HS'}
page = 1
data = '{"query":"","responseFields":["hits","nbHits","nbPages","page","nbHits"],"filters":"categories.level1:vegan-dishes","getRankingInfo":false,"attributesToHighlight":[],"attributesToRetrieve":["*","-tags","-categories","-additional_content","-_highlightResult"],"facetFilters":["content_type:recipe","language:en"],"page":%s,"hitsPerPage":24}' %page
jsonData = requests.post(url, params=payload, data=data).json()
totPages = jsonData['nbPages']
recipes = []
for page in range(1,totPages 1):
print(f'Page: {page} of {totPages}')
if page == 1:
print(len(jsonData['hits']))
recipes = jsonData['hits']
else:
data = '{"query":"","responseFields":["hits","nbHits","nbPages","page","nbHits"],"filters":"categories.level1:vegan-dishes","getRankingInfo":false,"attributesToHighlight":[],"attributesToRetrieve":["*","-tags","-categories","-additional_content","-_highlightResult"],"facetFilters":["content_type:recipe","language:en"],"page":%s,"hitsPerPage":24}' %page
jsonData = requests.post(url, params=payload, data=data).json()
print(len(jsonData['hits']))
recipes = jsonData['hits']
df = pd.json_normalize(recipes, meta=['api_content'])
Output:
print(df['api_content.url'])
0 https://www.kitchenstories.com/en/recipes/suns...
1 https://www.kitchenstories.com/en/recipes/hot-...
2 https://www.kitchenstories.com/en/recipes/vale...
3 https://www.kitchenstories.com/en/recipes/quic...
4 https://www.kitchenstories.com/en/recipes/thai...
494 https://www.kitchenstories.com/en/recipes/home...
495 https://www.kitchenstories.com/en/recipes/oven...
496 https://www.kitchenstories.com/en/recipes/appl...
497 https://www.kitchenstories.com/en/recipes/home...
498 https://www.kitchenstories.com/en/recipes/vega...
Name: api_content.url, Length: 499, dtype: object
CodePudding user response:
What happens?
The main problem is that the website does not work as expected for the user and therefore not for selenium
. If you scroll down manually, you will see that the website changes to https://www.kitchenstories.com/en/categories on the last scroll before you can see the final chunk of recipes.
How to deal with this?
@chitown88 has already shown you a very good approach to retrieve well-structured information.
I would therefore like to additively address the approach supported by Selenium to get 523 out of 523 results.
Because of the scrolling behaviour I mentioned, we need to control our process better and so we first access the number of total results and use that in range()
for our for loop. In each iteration, we only scroll the last result into view, so we don't have to deal with the height of the screen or the body:
...
numResults = int(driver.find_element(By.CSS_SELECTOR, 'article>section>p').text.split()[0])
for i in range(1,int(numResults/24)):
driver.execute_script("arguments[0].scrollIntoView()", driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')[-1])
time.sleep(0.5)
driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')
soup = BeautifulSoup(driver.page_source)
...
Example
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time
url = 'https://www.kitchenstories.com/en/categories/vegan-dishes'
options = Options()
options.headless = True
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=options)
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 5)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[data-test="ks-card"]')))
numResults = int(driver.find_element(By.CSS_SELECTOR, 'article>section>p').text.split()[0])
for i in range(1,int(numResults/24)):
driver.execute_script("arguments[0].scrollIntoView()", driver.find_elements(By.CSS_SELECTOR, 'li[data-test="ks-card"]')[-1])
time.sleep(0.5)
soup = BeautifulSoup(driver.page_source)
data = []
for e in soup.select('li[data-test="ks-card"]'):
data.append({
'title':e.h3.text,
'time':t.parent.text if (t := e.time) else None,
'url':e.a.get('href'),
'img': i['src'] if(i := e.img) else None
})
data
pd.DataFrame(data)