Home > Software engineering >  Scraping - Cannot identify product class
Scraping - Cannot identify product class

Time:05-10

Good afternoon all,

Been trying to develop a scrapper for this specific page.

I am trying to extract product title and prices.

Code is the following

from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse

website = 'https://www.thewhiskyexchange.com/c/339/rum'
response = requests.get(website)
response.status_code
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('li',{'product-grid__item'})

If I do "len(results)", I will get a result of 24.

However when actually calling result (results[0]), I only get 1 item returned.

<li ><a  href="/p/63818/bumbu-the-original-rum-glass-pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])" title=" Bumbu The Original Rum Glass Pack"><div ><img alt="Bumbu The Original Rum Glass Pack"  height="4" loading="lazy" src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" width="3"/></div><div ><p > Bumbu The Original Rum<span >Glass Pack</span></p><p > 70cl / 40% </p></div><div ><p > £39.95 </p><p > (£57.07 per litre) </p></div></a></li>

My question is: am I looking at the right class. I tried other classes, but it doesnt seem to work either. Or is there a problem the code?

(I should say I am trying to teach myself how to code, so wouldnt be surprised if something is missing)

CodePudding user response:

Everything is OK. results is actually a list data-type variable (what is means there are many results for this search soup.find_all('li',{'product-grid__item'})), so doing this results[0] you're accessing first element of the list. You can do : print(results) to see all elements in results or use a for loop:

for result in results:
  print(result) 

CodePudding user response:

Product titles are immediate after [] that's text node. So to get text node value you can call .find(text=True) method.The same way is to grab price.Now,It's working

from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse

website = 'https://www.thewhiskyexchange.com/c/339/rum'
response = requests.get(website)
response.status_code
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('li',{'product-grid__item'})

for result in results:
    title = result.select_one('.product-card__name').find(text=True)
    print(title)
    try:
        price = result.select_one('.product-card__unit-price').find(text=True).replace('(','').replace(')','')
        print(price)
    except:
        pass

Output:

Bumbu The Original Rum
 £57.07 per litre 
 Kraken Black Spiced
 £54.64 per litre
 Kraken Black Roast Coffee Rum
 £38.21 per litre
 Doorly's 14 Year Old Rum
 £87.79 per litre
 Admiral Vernon's Old J Spiced Tiki Fire Rum
 £59.93 per litre
 Ron Zacapa Centenario Sistema Solera 23 Rum
 £78.50 per litre
 Old Monk 7 Year Old Rum
 £35.64 per litre 
 Diplomatico Reserva Exclusiva Rum
 £64.21 per litre
 Pusser's Select Aged 151 Navy Rum
 £69.93 per litre
 Diplomatico Reserva Exclusiva Rum
 £58.50 per litre
 El Dorado Rum 15 Year Old
 £78.50 per litre
 Plantation Extra Old Barbados Rum
 £77.50 per litre
 Captain Morgan Black Spiced
 Doorly's XO Rum
 £53.50 per litre 
 Mount Gay XO Triple Cask Blend
 £76.79 per litre
 Diplomatico Reserva Exclusiva Rum
 £58.50 per litre
 Plantation Barbados 5 Year Old Signature Blend Rum
 £44.64 per litre
 Worthy Park Single Estate Reserve
 £69.93 per litre
 Pusser's Blue Label British Navy Rum
 £39.93 per litre
 Ron Zacapa Centenario XO Rum Solera Gran Reserva Especial
 £150 per litre
 Havana Club 3 Year Old Rum
 £30.64 per litre 
 Santa Teresa 1796 Rum
 £74.93 per litre
 Eminente Reserva 7 Year Old
 £64.93 per litre
 Bumbu The Original Rum
 £48.21 per litre
  • Related