Home > Blockchain >  Scraping products from automatic loading website
Scraping products from automatic loading website

Time:12-28

My issue is that i am scraping products from a website that loads products automaticcly when you scroll down and I did th srcaping for 24 itmes, So my qauestion is what code can i use it to loop all the products that i want in the following link, yet the link does not have a word which can indicate what page I am in

from bs4 import BeautifulSoup
import requests 
import pandas as pd
from time import sleep
import urllib.parse
import urllib
import webbrowser
import json
import urllib.request

product_name = []
product_brand = []
product_price =[]
product_img = []
relative_url = []

    
website = 'https://en-saudi.ounass.com/women/beauty/fragrance'
    
response = requests.get(website)
    
soup = BeautifulSoup(response.content, 'html.parser')
    
results = soup.find_all('div', {'class':'Product-contents'})
    
for result in results :
    #name
      try:
        product_name.append(result.find('div',{'class':'Product-name'}).get_text())
      except:
        product_name.append('n/a')
    
    #brand
      try:
        product_brand.append ( result.find('div',{'class':'Product-brand'}).get_text())
      except:
        product_brand.append('n/a')
        
    #price
      try:
        product_price.append ( result.find('span',{'class':'Product-minPrice'}).get_text())
      except:
        product_price.append('n/a')
    #pics
      try:
        product_img.append (result.find('img',{'class':'Product-image'}).get('data-src'))
      except:
        product_img.append('n/a')
    #relative_url
      try:
        relative_url.append (result.find('a',{'class':'Product-link'}).get('href'))
      except:
         relative_url.append('n/a')

CodePudding user response:

U just need to use the public API. There's a lot more information here that u ll need. And it also works much faster than selenium. Here is an example with the fields that were in ur question:

import requests
import pandas as pd


results = []
page = 0
while True:
    url = f"https://en-saudi.ounass.com/api/women/beauty/fragrance?sortBy=popularity-asc&p={page}&facets=0"
    hits = requests.get(url).json()['hits']
    if hits:
        page  = 1
        for hit in hits:
            results.append({
                'Name': hit['analytics']['name'],
                'Brand': hit['analytics']['brand'],
                'Price': hit['price'],
                'Image': hit['_imageurl'],
                'Link': f"https://en-saudi.ounass.com/{hit['slug']}.html"
            })
    else:
        break
df = pd.DataFrame(results)
print(df)

OUTPUT:

                                           Name  ...                                               Link
0           Cœur de Jardin Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-miller-harris...
1        Patchouli Intense Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-nicolai-parfu...
2            Blue Sapphire Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-boadicea-the-...
3           Ambre Vanillé Eau de Toilette, 50ml  ...  https://en-saudi.ounass.com/shop-laura-mercier...
4     Baccarat Rouge 540 Scented Body Oil, 70ml  ...  https://en-saudi.ounass.com/shop-maison-franci...
...                                         ...  ...                                                ...
2368               Olene Eau de Toilette, 100ml  ...  https://en-saudi.ounass.com/shop-diptyque-olen...
2369  Magnolia Nobile Leather Purse Spray, 20ml  ...  https://en-saudi.ounass.com/shop-acqua-di-parm...
2370           Eau du Soir Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-sisley-eau-du...
2371              Yvresse Eau de Toilette, 80ml  ...  https://en-saudi.ounass.com/shop-ysl-beauty-yv...
2372               Lalibela Eau de Parfum, 75ml  ...  https://en-saudi.ounass.com/shop-memo-paris-la...

CodePudding user response:

You will need selenium to do this. Selenium opens a webpage (using drivers) and performs the actions you specify, like scrolling.

The code itself will depend on the website structure, but here are the main steps to get you started:

  1. Download chrome or firefox driver

  2. Import Selenium

  3. Configure selenium to use the driver

  4. Open the website

  5. Find the element with scroll and uer arrow down key to scroll down.

  6. Get the information you need from loaded products Use python sleep to make sure everything is loaded and scroll again as long as you need

     # Import
     from selenium import webdriver
     from selenium.webdriver.common.keys import Keys
    
     # Open a driver (using Firefox in the example) which can download
     profile = webdriver.FirefoxProfile()
     profile.set_preference('intl.accept_languages', 'en-us')
     profile.update_preferences()
     driver = webdriver.Firefox(firefox_profile=profile, executable_path='executable_path')
    
     # Open the site
     driver.get('https://www.example.com/products')
    
     # Find the elmenet with the scroll and scroll using arrow down key (10 times)
     elem = driver.find_element_by_xpath('xpath_to_element_with_scroll')
     while i < 10:
         elem.send_keys(Keys.ARROW_DOWN)
         i  
    
     # Here you will find the products and save them somewhere and do it all again if needed.
    
  • Related