Home > Blockchain >  data only alternately gets fetched properly (inconsistently fetched) from a website
data only alternately gets fetched properly (inconsistently fetched) from a website

Time:11-24

I'm trying to get the data from a website, and here are the codes of what I did:

These are the modules

import bs4
import pandas as pd
import numpy as np
import random
import requests
from lxml import etree
import time
from tqdm.notebook import tqdm

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager

Here is for getting the urls of each target product:

driver = webdriver.Chrome(ChromeDriverManager().install())

for page in tqdm(range(5, 10)):
    driver.get("https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page=" str(page) "&sortBy=pop")
    
    skincare = driver.find_elements(By.XPATH, '//div[@]//a[@data-sqe="link"]')

    for _skincare in tqdm(skincare):
        urls.append({"url":_skincare.get_attribute('href')})
driver.quit()

It was successfully fetched. And here's what I did next:

data_final = pd.DataFrame(urls)

driver = webdriver.Chrome(ChromeDriverManager().install())
skincares = []

for product in tqdm(data_final["url"]):
    driver.get(product)
    try:
        company = driver.find_element(By.XPATH,"//div[@class='CKGyuW']//div[@class='_1Yaflp page-product__shop']//div[@class='_1YY3XU']//div[@class='zYQ1eS']//div[@class='_3LoNDM']").text
    except:
        company = 'none'
    try:
        product_name = driver.find_element(By.XPATH,"//div[@class='flex flex-auto eTjGTe']//div[@class='flex-auto flex-column  _1Kkkb-']//div[@class='_2rQP1z']//span").text
    except:
        product_name = 'none'
    try:
        rating = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB _14izon']").text
    except:
        rating = 'none'
    try:
        number_of_ratings = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB']").text
    except:
        number_of_ratings = 'none'
    try:
        sold = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3EOMd6']//div[@class='HmRxgn']").text
    except:
        sold = 'none'
    try:
        price = driver.find_element(By.XPATH,"//div[@class='_2Shl1j']").text
    except:
        price = 'none'
    try:
        description = driver.find_element(By.XPATH,"//div[@class='_1MqcWX']//p[@class='_2jrvqA']").text
    except:
        description = 'none'
        
    
    skincares.append({
        "url": product,
        "company": company,
        "product name": product_name,
        "rating": rating,
        "number of ratings": number_of_ratings,
        "sold": sold,
        "price": price,
        "description": description,

        })
    time.sleep(5)

I put time.sleep(x) to avoid getting blocked, and I tried x = 1, 1.5, 2, 5 , 15. What the code above got was not consistent. Calling

skincares_data = pd.DataFrame(skincares)
skincares_data

I get enter image description here

Which is a bunch of blank or not properly fetched data. One thing thoughis that if I rerun the code, I get another set of data in which some of those which are blank now has data, and some of those that were properly fetched are now blank. Running it for another time the same problem occurs.

I think being "blocked" by the website isn't the problem here (I just used the time.sleep()to make it sure).

Any comments?

I tried to get data from a website, I successfully got the urls but the details of each product are not properly fetched. There a re a lot of blank data. And alternately they either go blank or properly fetched.

CodePudding user response:

Page is being loaded dynamically, as you scroll it down. The following code should solve your issue:

[..]
wait = WebDriverWait(driver, 15)
url='https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page=1&sortBy=pop'
driver.get(url)
rows= wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "shopee-search-item-result__item")]')))
for r in rows:
    r.location_once_scrolled_into_view
t.sleep(5)
products = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@data-sqe="item"]')))
for p in products:
    name = p.find_element(By.XPATH, './/div[@data-sqe="name"]').text.strip()
    some_id = p.find_element(By.XPATH, './/a[@data-sqe="link"]').get_attribute('href').split('?sp_atk=')[0].split('-i.')[1]
    print(name, some_id)

All items will be printed in terminal:

ORIG M.Q. Cosmetics MACAROON LIP THERAPY LIPBALM WITH SPATULA | MQ
wholesale 10092844.9115684791
Magic Lip Therapy Balm in 10g jar (FREE Spatula) Rebranding NO STICKER! 286498185.11511633880
BIOAQUA COLLAGEN Nourish Lips Membrane Moisturizing Lip Mask moisture nourishing skin care soft 295464315.8585504678
Lip therapy Cosmetic Potion lipbalm
₱5 off
Free Gift 11055729.11663828134
VASELINE Rosy Lip Stick 4.8g 92328166.8130605004
Collagen Crystal lip mask lips plump gel personal care hydrating lip whitening a smacker wrinkle gel 386726777.2925165359
blk cosmetics fresh lip scrub coco crush 62677292.5532509493
[...]

Selenium documentation can be found here

  • Related