Home > OS >  Boots.com - Python Web Scraping only returned result of first page
Boots.com - Python Web Scraping only returned result of first page

Time:07-18

I am trying to scrape data from Boots.com skincare category page: Boots.skincare

There are 122 pages of skincare products in total.

I have successfully scraped the data on the first page using requests and BeautifulSoup. Here is the code:

productlinks = []

r = requests.get('https://www.boots.com/beauty/skincare/skincare-all-skincare')
soup = BeautifulSoup(r.content, 'lxml')

productlist = soup.find_all('div', class_ = 'product_name')
    
for item in productlist:
    for link in item.find_all('a',href = True):
        productlinks.append(link['href'])

However, when I tried to expand the scraper to other pages, it only returned the result of first page.

  1. I've tried using loop, but it was repeating the same product url. Following code gave me 48 results but there were duplicates of first page's 24 items.

    
    productlinks = []
    
    for i in range (24,72,24):
        page = f'https://www.boots.com/beauty/skincare/skincare-all-skincare#facet:&productBeginIndex:{i}&orderBy:&pageView:grid&minPrice:&maxPrice:&pageSize:&'
        soup = BeautifulSoup(r.content,'lxml')
        productlist = soup.find_all('div', class_ = 'product_name')
    
        for item in productlist:
            for link in item.find_all('a',href = True):
                productlinks.append(link['href'])
    
    
  2. I tried to used the url of the 2nd page but it still returned data from the first page

    productlinks = []
    
    r = requests.get('https://www.boots.com/beauty/skincare/skincare-all-skincare#facet:&productBeginIndex:24&orderBy:&pageView:grid&minPrice:&maxPrice:&pageSize:&')
    soup = BeautifulSoup(r.content,'lxml')
    
    productlist = soup.find_all('div', class_ = 'product_name')
    
    for item in productlist:
        for link in item.find_all('a',href = True):
            productlinks.append(link['href'])
    
    

I've searched for similar questions, but most of the websites URL use page = i to identify the page, instead Boots.com uses productBeginIndex:{i} in the URL.

I am not sure if this is the reason to cause the issue.

CodePudding user response:

If you go to Network tab in Chrome, you will notice that when you switch pages, there is a POST request to: Request URL:

https://www.boots.com/ProductListingViewRedesign?ajaxStoreImageDir=/wcsstore/eBootsStorefrontAssetStore/&searchType=1000&advancedSearch=&cm_route2Page=&filterTerm=&storeId=11352&cm_pagename=&manufacturer=&sType=SimpleSearch&metaData=&catalogId=28501&searchTerm=&resultsPerPage=24&filterFacet=&resultCatEntryType=&gridPosition=&emsName=&disableProductCompare=false&langId=-1&facet=&categoryId=2300180

This post request has a payload - you can also find it in Network tab. Do a POST request with correct headers and payload, and you will get your expected results. If you have difficulties at any step, post back your attempts, and you will receive further help.

CodePudding user response:

Data is generating from external source via API url as AJAX request as HTML.

Script:

import requests
from bs4 import BeautifulSoup
data='contentBeginIndex=0&pageNo={p}&productBeginIndex=24&beginIndex=24&orderBy=&facetId=&pageView=grid&resultType=products&orderByContent=&searchTerm=&facet=&facetLimit=&minPrice=&maxPrice=&pageSize=&prem=&article=&storeId=11352&catalogId=28501&langId=-1&objectId=_6_3074457345618283155_3074457345619405964&requesttype=ajax'
headers= {
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9',
        'content-type': 'application/x-www-form-urlencoded',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.boots.com/beauty/skincare/skincare-all-skincare',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',}

api_url='https://www.boots.com/ProductListingViewRedesign?ajaxStoreImageDir=/wcsstore/eBootsStorefrontAssetStore/&searchType=1000&advancedSearch=&cm_route2Page=&filterTerm=&storeId=11352&cm_pagename=&manufacturer=&sType=SimpleSearch&metaData=&catalogId=28501&searchTerm=&resultsPerPage=24&filterFacet=&resultCatEntryType=&gridPosition=&emsName=&disableProductCompare=false&langId=-1&facet=&categoryId=2300180'

productlinks = []  
for p in range(1,10):
        r = requests.post(api_url.format(p=p),headers=headers,data=data)
        #print(r)

        soup = BeautifulSoup(r.content, 'lxml')
        productlist = soup.find_all('div', class_ = 'product_name')

        
        for item in productlist:
                for link in item.find_all('a',href = True):
                        productlinks.append(link['href'])
print(productlinks)

Output:

['https://www.boots.com/origins-ginzing™-glow-boosting-mask-75ml-10315637', 'https://www.boots.com/nivea-cherry-shine-caring-lip-balm-5-5ml-10307437', 'https://www.boots.com/sanctuary-spa-signature-collection-hand-wash-antibacterial-refill-500ml-10314694', 'https://www.boots.com/cetraben-natural-oatmeal-cream-475g-10311057', 'https://www.boots.com/beauty/new-in-beauty-skincare/boots-tea-tree-and-witch-hazel-charcoal-face-scrub-150ml-10296405', 'https://www.boots.com/beauty/new-in-beauty-skincare/boots-tea-tree-and-witch-hazel-charcoal-facial-wash-150ml-10296406', 'https://www.boots.com/boots-tea-tree-and-witch-hazel-nose-pore-strips-6-strips-10296409', 'https://www.boots.com/boots-tea-tree-and-witch-hazel-clarifying-plastic-free-sheet-mask-19g-10297713', 'https://www.boots.com/ProductDisplay?errorViewName=ProductDisplayErrorView&storeId=11352&urlLangId=&productId=2712630&urlRequestType=Base&langId=-1&catalogId=28501', 
'https://www.boots.com/dior-micellar-water-200ml-10313501', 'https://www.boots.com/boots-vitamin-c-brightening-sleeping-mask-50ml-10304717', 'https://www.boots.com/beauty/skincare/vegan-skincare-products/boots-tea-tree-and-witch-hazel-exfoliating-pads-60s-10316494', 'https://www.boots.com/sanctuary-spa-signature-collection-body-butter-75ml-10314705', 'https://www.boots.com/liz-earle-your-daily-routine-with-superskin™-moisturiser-unfragranced-10316108', 'https://www.boots.com/bobbi-brown-vitamin-enriched-face-base-50ml-10300634', 'https://www.boots.com/boots-tea-tree-and-witch-hazel-day-and-night-spot-wand-10300997', 'https://www.boots.com/boots-dermacare-rosacea-treatment-serum-25ml-10308396', 'https://www.boots.com/liz-earle-your-daily-routine-with-skin-repair™-gel-cream-10316104', 'https://www.boots.com/liz-earle-your-daily-routine-with-skin-repair™-light-cream-10316105', 'https://www.boots.com/liz-earle-your-daily-routine-with-skin-repair™-rich-cream-10316106', 'https://www.boots.com/sol-de-janeiro-brazilian-4play-moisturizing-shower-cream-gel-1000ml-10318265', 'https://www.boots.com/sol-de-janeiro-rio-deo-aluminum-free-deodorant-57g-10318269', 'https://www.boots.com/sol-de-janeiro-coco-cabana™-moisturizing-body-cream-cleanser-90ml-10318277', 'https://www.boots.com/liz-earle-your-daily-routine-with-superskin™-moisturiser-natural-neroli-10316107']
  • Related