Home > Software design >  Can't seem to scrape specific information from webpage?
Can't seem to scrape specific information from webpage?

Time:05-17

I'm attempting to scrape some information for each item displayed on the following page: https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&

However, I can't seem to access the item information. The information I'm after is the name and link for each product, which for example for the first item is contained in:

<a aria-hidden="true" tabindex="-1" id="WC_CatalogEntryDBThumbnailDisplayJSPF_3074457345616901168_link_9b" href="/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New Spirits&amp;categoryType=Spirits&amp;fromURL=/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+Spirits&categoryType=Spirits&top_category=25208&parent_category_rn=&sortBy=0&searchSource=E&pageView=&beginIndex=0">Woodford Reserve Master Collection Five Malt Stouted Mash</a>

So the information I'm trying to scrape is:

Woodford Reserve Master Collection Five Malt Stouted Mash

and

/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New Spirits&amp;categoryType=Spirits&amp;fromURL=/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+Spirits&categoryType=Spirits&top_category=25208&parent_category_rn=&sortBy=0&searchSource=E&pageView=&beginIndex=0

I'm trying to iterate this for every item on the page. I'm definitely connecting to the page, yet for some reason I can't scrape any information using for product in soup.select Below is a simplified version of my script in which I've been trying to gather information from the above catalog_item_name

import requests
import sys
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
from datetime import datetime
import json
import random
import requests
from itertools import cycle
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from urllib3.exceptions import InsecureRequestWarning

from requests_html import HTMLSession
session = HTMLSession()


user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for i in range(1,4):
    #Pick a random user agent
    user_agent = random.choice(user_agent_list)



url = []
url = 'https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&'

response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,features="html.parser")
link = []

for product in soup.select('a.catalog_item_name'):
    link.append(product)

print(link)

Any help would be greatly appreciated!

Edit: Tested the script with two other websites and it works just fine. There must be something about the site which is throwing it off?

CodePudding user response:

I guess that the best approach here is to inspect the network traffic and query the API directly. Eg for above url there is some POST request against an API at https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CategoryProductsListingView.

I can use that to get a list of products, ie:

from bs4 import BeautifulSoup
import requests
import urllib

base_url = 'https://www.finewineandgoodspirits.com'
path = '/webapp/wcs/stores/servlet/CategoryProductsListingView?sType=SimpleSearch&resultsPerPage=15&sortBy=0&disableProductCompare=false&ajaxStoreImageDir=/wcsstore/WineandSpirits/&variety=New Spirits&categoryType=Spirits&ddkey=ProductListingView'
params = {
    'storeId': '10051',
    'categoryId': '1351370',
    'searchType': '1002'
}

headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'some super fancy browser',
}

request_url = base_url   path   '&'   urllib.parse.urlencode(params)
response = requests.post(request_url, headers=headers)
soup = BeautifulSoup(response.text)

# now, extract the content form the soup, eg like you did above
product_links: list[str] = [base_url   a['href'] for a in soup.select('a.catalog_item_name')]
  • Related