Home > Software design >  Scraping Aliexpress search page does not return all products
Scraping Aliexpress search page does not return all products

Time:01-04

I have the below code, which I expect to return 60 products, but instead only returns 16:

driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))

url = 'https://www.aliexpress.com/w/wholesale-silicone-night-light.html?SearchText=silicone night light"&"catId=0"&"initiative_id=SB_20230101130255"&"spm=a2g0o.productlist.1000002.0"&"trafficChannel=main"&"shipFromCountry=US"&"g=y'

driver.get(url)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

product_links = []


def get_element_title(element):
    return element.select('h1[class*="manhattan--titleText--"]')[0].text


def get_product_links(soup):
    for element in soup.select('a[class*="manhattan--container--"]'):
        link = f"http:{element['href']}"
        product_links.append(link)
        print(get_element_title(element))


get_product_links(soup)

I manually checked the class name for all the products, since I thought maybe some of them have different class names in an effort to stop scraping, but they all have the same class name.

enter image description here

CodePudding user response:

The reason all products were not found is because they only appear once you scroll past them. To achieve that, we could run some javascript code to scroll one window height at a time until the end of the document. Also, instead of using BeautifulSoup for scraping, why not just use Selenium itself to find those elements using driver.find_elements by CSS Selector, where the tag name starts with the identified string pattern (i.e. use ^=).

Here is the full code:

def launch_url(url):
    # create webdriver object
    chrome_srv = Service(driver_path)
    driver = webdriver.Chrome(service=chrome_srv)
    driver.get(url)
    # find doc/window height and compute page count
    doc_height = driver.execute_script("return document.body.scrollHeight")
    win_height = driver.execute_script("return window.innerHeight")
    num_pages = int(doc_height / win_height)
    print(f'doc height=>{doc_height}\tpages =>{num_pages}')
    # scroll through the document
    for page in range(num_pages):
        driver.execute_script("window.scrollTo(0, arguments[0]);", win_height * (page 1))
        print(f'scrolling to=>{win_height * (page 1)}')
        sleep(2)

    # get all product anchor tags
    anc_elem_list = driver.find_elements(By.CSS_SELECTOR,'a[class^=manhattan--container--1lP57Ag]')
    # get all product title tags
    title_elem_list = driver.find_elements(By.CSS_SELECTOR,'h1[class^=manhattan--titleText--]')
    print(len(anc_elem_list),len(title_elem_list))
    for anc_elem,title_elem in zip(anc_elem_list,title_elem_list):
        print(anc_elem.get_attribute('href'),title_elem.text)

Launching this with the given URL, it was able to find all 60 products. Showing below some output extract:

>>> launch_url('https://www.aliexpress.com/w/wholesale-silicone-night-light.html...')
doc height=>5037    pages =>8
scrolling to=>622
scrolling to=>1244
scrolling to=>1866
scrolling to=>2488
scrolling to=>3110
scrolling to=>3732
scrolling to=>4354
scrolling to=>4976
60 60
https://www.aliexpress.com/item/1005001381826134.html?...&curPageLogUid=gGmfVB3DBYLJ LED Night Lamp Touch Sensor Cat Silicone Animal Light Colorful Child Holiday Gift Sleepping Creative Bedroom Desktop Decor Lamp
https://www.aliexpress.com/item/1005004984676285.html?...&curPageLogUid=uzZMOGxkhRNa Mini LED Night Light 7 Color Soft Silicone Touch Sensor Lamp Kawaii Cartoon Cute Animal Cat Nightlight Table Lamp Couple Gift
https://www.aliexpress.com/item/1005003422007432.html?...&curPageLogUid=N0iy8GSRNn8j Dog LED Night Light Touch Sensor Remote Control 16 Colors Dimmable USB Rechargeable Silicone Puppy Lamp for Children Baby Gift
...

CodePudding user response:

Here is one way of getting that information, without the overheads of Selenium, by scraping the API endpoint directly (you can find it in Dev tools - Network tab). JSON response and the resulting dataframe contains a wealth of information, including product links etc:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

big_df = pd.DataFrame()

headers = {
    'accept-language': 'en-US,en;q=0.9',
    'bx-v': '2.2.3',
    'content-length': '80',
    'content-type': 'application/json',
    'origin': 'https://aliexpress.ru',
    'referer': 'https://aliexpress.ru/popular/silicone-night-light.html?CatId=0&g=n&page=3&spm=a2g0o.productlist.1000002.0',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}

s = requests.Session()
s.headers.update(headers)

x = 1
while True:
    try:
        print('doing page', x)
        payload = '{"catId":"0","g":"n","searchText":"silicone night light","storeIds":[],"page":'   str(x)   '}'
        r = s.post('https://aliexpress.ru/aer-webapi/v1/seo/popular_search?_bx-v=2.2.3', data=payload)
        df = pd.json_normalize(r.json()['data']['productsFeed']['products'])
        big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
        x =1
    except Exception as e:
        print('all done')
        break
print(big_df)

Result in terminal:

doing page 1
doing page 2
doing page 3
doing page 4
doing page 5
doing page 6
doing page 7
doing page 8
doing page 9
doing page 10
doing page 11
doing page 12
doing page 13
doing page 14
doing page 15
doing page 16
doing page 17
doing page 18
doing page 19
doing page 20
doing page 21
doing page 22
doing page 23
doing page 24
doing page 25
doing page 26
doing page 27
doing page 28
doing page 29
doing page 30
doing page 31
doing page 32
doing page 33
doing page 34
all done
id  imgSrc  productUrl  productTitle    storeUrl    storeTitle  tags    userTags    fullPrice   discount    finalPrice  freeDelivery    sales   salesLink   rating  ratingLink  marketingBadge  ad  hot isWished    p4p imgGallery  affTraceInfo    trace   p4p.clickUrl
0   4000808672930   //ae04.alicdn.com/kf/Hc50ce1cca85341f086761bedb89952c2h.jpg_350x350.jpg /item/4000808672930.html    Силиконовый ночсветильник с датчиком касания, перезаряжаемый по USB ночник в виде кошки, утки, прикроватная лампа с дистанционным управление... //aliexpress.ru/store/1827627   newstyle Official Store []  []  1 979,96 руб.   34  1 307,13 руб.   False       /item/4000808672930.html    4.8 /item/4000808672930.html?#feedback      False   False   False   NaN []  None    None    NaN
1   1005001870534143    //ae04.alicdn.com/kf/Sda83088b56dc4f5d91e0fa4cfd80e9a80.jpg_350x350.jpg /item/1005001870534143.html Оптовая продажа, светодиодный неоновый ночник, вывеска, настенная вывеска, ночная лампа, подарок на Рождество, день рождения, Свадебная вече... //aliexpress.ru/store/4650092   Light Us Store  []  []  1 581,92 руб.   45  869,71 руб. False       /item/1005001870534143.html 4.8 /item/1005001870534143.html?#feedback       True    False   False   NaN []  None    None    //us-click.aliexpress.com/ci_bb?ot=local&a=1279654123&e=ncYikJBnio-JDjh3Asn7-ZC3Tx1Scbeub6R3v6AkWe7ZlkgZ6NRi9Tv6IathNKgkOJaRtqgOQMqri2zhzi.u6rk0cO.smYqF2qB6qjLhmHvMFWYLitrEcr6VFIgsBEpH7Lnm6VwOwintJSeWlGBa-zqRoUgRQawPZm.nVCTYqeb7htDcAGHewTUzmXS1hs9C0soIE8-tsKNeuN.1Dzn9fwo3i83DHuWpQkQR7TKBV3AbF5WH2b6YsWCofrmS0.At3KJZZFSv6WhxjzWckAL3xxTzykNTf4wG5kWF5hoMkxz3Mx6oVaCRXb3W6PLQCh551nn0BIw2eUzboSadTJYC82wJEEId7JwIFsdJMuXzT0HRmyR6NPTZgFW32d2Dbk-B5K-I85TBu-slQbT4jlpJIRNSDnUsnCh4JyiRLGrnlzxu5t-qJyzzhXsUI9LUKFlfa96DYrxi2TLv0K1vXl4L84wQccKEX3YGJnl1zvgmN04h7XUTRsJuJDFJ0GgrMYzmZ3R8wz5.KpEB0-I2W9celn0wKCRy2RT9F1jvxHlpDN5.YxA4wN-bS4qDoQWN7pabmoUzxrJl2RU5OTaK5vBO.WYlC15pwdIXhEUjZYXPfR2jeSTKXDLyLhFmigOoDpi21tOTNxL-FzDZ3MfuSHmc0cS6vMBgVPpQAwoBoSicLu73uQC9743kbOFMGnbIOdGNesvaTgvAQ3mzBqVq7lmaEJwgApHtocsAERkGRZL9Q.jby4eg9IJ2GY8.pST87U77N4pU.ImsVywUFR5rQ-mnqggyFTmfatT9f8ZDlREjqNnTSRCsiPQ1ilb85SoyAOiilGB7no5mOklURZdj2wQNr3TvEJAGwS3Rzs6CBP.ZnV83z.AriOi7OKTKVCQEkHLR&ap=1&rp=1
2   4000251461236   //ae04.alicdn.com/kf/Hda262188b88444ee944889f86e5a1238n.jpg_350x350.jpg /item/4000251461236.html    СВЕТОДИОДНЫЙ ночник в виде кролика с сенсорным датчиком и дистанционным управлением, 9 цветов, с регулируемой яркостью, с таймером, аккумуля... //aliexpress.ru/store/5046292   STARNIGHT Official Store    []  []  2 064,71 руб.   28  1 486,90 руб.   False       /item/4000251461236.html    4.9 /item/4000251461236.html?#feedback      False   False   False   NaN []  None    None    NaN
3   32781209713 //ae04.alicdn.com/kf/Sd1863e90fc5e4c8f9daf81945cf4046au.jpg_350x350.jpg /item/32781209713.html  Силиконовый светодиодный Ночной светильник с сенсорным датчиком, 7 цветов, 2 режима //aliexpress.ru/store/1827627   newstyle Official Store []  []  1 094,84 руб.   27  799,52 руб. False       /item/32781209713.html  4.8 /item/32781209713.html?#feedback        False   False   False   NaN []  None    None    NaN
4   4000248571768   //ae04.alicdn.com/kf/H056a1b9b7b4d4ba3b34571f23578bf4b3.jpg_350x350.jpg /item/4000248571768.html    Силиконовый светодиодный ночник для собак с сенсорным датчиком, приглушаемая лампа с таймером и USB-зарядкой, прикроватная лампа для щенков, ...    //aliexpress.ru/store/5046292   STARNIGHT Official Store    []  []  2 027,04 руб.   35  1 317,41 руб.   False       /item/4000248571768.html    5.0 /item/4000248571768.html?#feedback      False   False   False   NaN []  None    None    NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2026    1005003731508193    //ae04.alicdn.com/kf/S5373d9db25ad40e49b787e63c0161cabb.jpg_350x350.jpg /item/1005003731508193.html Мини Силиконовый сенсорный Ночной светильник Snow Mountain рождественские подарки Светодиодная лампа со звуковым управлением и защитой глаз //aliexpress.ru/store/912722285 Brighting House Decoration Store    []  []  2 511,55 руб.   12  2 210,23 руб.   False       /item/1005003731508193.html 0.0 /item/1005003731508193.html?#feedback       False   False   False   None    []  None    None    NaN
2027    1005003003793655    //ae04.alicdn.com/kf/H53be514e21874c18bdf30ff1523582d2X.jpg_350x350.jpg /item/1005003003793655.html Светодиодный ночсветильник в виде дракона, силиконовая лампа с подзарядкой через USB, разноцветсветильник освещение, прикроватный столик, у...  //aliexpress.ru/store/911605883 GeekLamp Store  []  []      0   1 367,91 руб.   False       /item/1005003003793655.html 0.0 /item/1005003003793655.html?#feedback       False   False   False   None    []  None    None    NaN
2028    1005001875095192    //ae04.alicdn.com/kf/H67346c6d1866414981136105d6d6ae6cw.jpg_350x350.jpg /item/1005001875095192.html Мультяшный динозавр светодиодный светодиодная силиконовая прикроватная лампа для чтения, лампа для чтения с краном для изменения цвета, USB ... //aliexpress.ru/store/910342016 7777 Sold Store []  []  2 046,73 руб.   32  1 391,88 руб.   False       /item/1005001875095192.html 0.0 /item/1005001875095192.html?#feedback       False   False   False   None    []  None    None    NaN
2029    1005004095166342    //ae04.alicdn.com/kf/S294805be90ba42eb9aa27032ca0e4e22T.jpg_350x350.jpg /item/1005004095166342.html Силиконовый светодиодный ночсветильник в виде панды, милая мультяшная лампа с сенсорным USB-датчиком, красочная прикроватная лампа для спал...  //aliexpress.ru/store/911840429 XvenDeng Lighting Store []  []  2 574,04 руб.   39  1 569,93 руб.   False       /item/1005004095166342.html 0.0 /item/1005004095166342.html?#feedback       False   False   False   None    []  None    None    NaN
2030    1005004100590504    //ae04.alicdn.com/kf/Sd1d2b98ffaf141ed88cec0e722e5d6d7p.jpg_350x350.jpg /item/1005004100590504.html Силиконовый ночник Pat в виде жирафа, прикровасветильник ночник для спальни, светодиодный ночник для кормления ребенка, детский ночник  //aliexpress.ru/store/911840429 XvenDeng Lighting Store []  []  2 449,06 руб.   39  1 493,75 руб.   False       /item/1005004100590504.html 0.0 /item/1005004100590504.html?#feedback       False   False   False   None    []  None    None    NaN
2031 rows × 25 columns
  • Related