Home > Enterprise >  Scraping a website that has a "Load more" button doesn't return info of newly loaded
Scraping a website that has a "Load more" button doesn't return info of newly loaded

Time:06-11

So I'm using Selenium to press the "Load more" button and everything loads properly. Then I want to get the info of all the loaded products but I only get the info of the first 36 items that are before the first "Load more" button.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import json
import time
import requests
allinfo=[]
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
url="https://zadaa.co/de-en/products/women/clothes-dresses/"
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),chrome_options=chrome_options)
driver.get(url)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
wait = WebDriverWait(driver, 10)
closebutton=wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="content"]/div[5]/button')))
closebutton.click()
for x in range(9):
    button = wait.until(EC.element_to_be_clickable((By.ID, "load-more-products")))
    button.click()
content=soup.find_all('a',class_='product-list-item')
for properties in content:
    brand=properties.find("p",class_='product-list-item-title').text
    info={
        'name':brand,
    }
    allinfo.append(info)
df=pd.DataFrame(allinfo)
print(df.head())
df.to_csv('zadaa.csv')

This is the web page I'm trying to scrape- https://zadaa.co/de-en/products/women/clothes-dresses/

Sorry for some weird English usage.

CodePudding user response:

You can simulate Ajax calls with requests module to get the data directly, without selenium (beware, there are 12k products):

import requests
from bs4 import BeautifulSoup


url = "https://zadaa.co/de-en/products/women/clothes-dresses/"
api_url = "https://zadaa.co/wp-admin/admin-ajax.php"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

payload = {
    "action": "get_more_products",
    "lang": "de-en",
    "security": "05ef973f4c",
    "query_id": soup.select_one("[data-query-id]")["data-query-id"],
    "offset": 0,
}


while True:
    data = requests.post(api_url, data=payload).json()
    if not data["success"]:
        break

    soup = BeautifulSoup(data["data"], "html.parser")

    for i in soup.select(".product-list-item"):
        print(i.select_one(".product-list-item-title").text)
        print(i["href"])
        print("-" * 80)

    payload["offset"]  = 36

Prints:

...

CITY GIRL PARIS
https://zadaa.co/de-en/products/women/clothes-dresses/city-girl-paris/3735824/
--------------------------------------------------------------------------------
ZAFUL
https://zadaa.co/de-en/products/women/clothes-dresses/zaful/3735781/
--------------------------------------------------------------------------------
NKD
https://zadaa.co/de-en/products/women/clothes-dresses/nkd/3735768/
--------------------------------------------------------------------------------
GREAT RUMORS
https://zadaa.co/de-en/products/women/clothes-dresses/great-rumors/3735762/
--------------------------------------------------------------------------------

...and so on.
  • Related