Beautifoulsoup Amazon Product Detail-CodePudding

I can't scrape the "Product Details" section (scrolling down the webpage you'll find it) html by using requests or requests_html. Find_all returns a 0 size object... Any Help?

from requests import session
from requests_html import HTMLSession

s = HTMLSession()
#s = session()
r = s.get("https://www.amazon.com/dp/B094HWN66Y")
soup = BeautifulSoup(r.text, 'html.parser')
len(soup.find_all("div", {"id":"detailBulletsWrapper_feature_div"}))

CodePudding user response：

Product details with different information:

Code:

from bs4 import BeautifulSoup 
import requests

cookies = {'session': '131-1062572-6801905'}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}

r = requests.get("https://www.amazon.com/dp/B094HWN66Y",headers=headers,cookies=cookies)
print(r)
soup = BeautifulSoup(r.text, 'lxml')
key = [x.get_text(strip=True).replace('\u200f\n','').replace('\u200e','').replace(':\n','').replace('\n', '').strip() for x in soup.select('ul.a-unordered-list.a-nostyle.a-vertical.a-spacing-none.detail-bullet-list > li > span > span.a-text-bold')][:13]
#print(key)

value = [x.get_text(strip=True) for x in soup.select('ul.a-unordered-list.a-nostyle.a-vertical.a-spacing-none.detail-bullet-list > li > span > span:nth-child(2)')]
#print(value)


product_details = {k:v for  k, v, in zip(key, value)}
print(product_details)

Output:

{'ASIN': 'B094HWN66Y', 'Publisher': 'Boldwood Books (September 7, 2021)', 'Publication date': 
'September 7, 2021', 'Language': 'English', 'File size': '1883 KB', 'Text-to-Speech': 'Enabled', 'Screen Reader': 'Supported', 'Enhanced typesetting': 'Enabled', 'X-Ray': 'Enabled', 'Word 
Wise': 'Enabled', 'Print length': '332 pages', 'Page numbers source ISBN': '1800487622', 'Lending': 'Not Enabled'}

CodePudding user response：

This is an example of how to scrape the title of the product using bs4 and requests, easily expandable to getting other info from the product.

The reason yours doesn't work is your request has no headers so Amazon realises your a bot and doesn't want you scraping their site. This is shown by your request being returned as <Response [503]> and explained in r.text.

I believe Amazon have an API for this (that they'd probably like you to use) but it'll be fine to scrape like this for small-scale stuff.

import requests
import bs4

# Amazon don't like you scrapeing them however these headers should stop them from noticing a small number of requests
HEADERS = ({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.157 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})


def main():
    url = "https://www.amazon.com/dp/B094HWN66Y"
    title = get_title(url)
    print("The title of %s is: %s" % (url, title))


def get_title(url: str) -> str:
    """Returns the title of the amazon product."""
    # The request
    r = requests.get(url, headers=HEADERS)

    # Parse the content
    soup = bs4.BeautifulSoup(r.content, 'html.parser')
    title = soup.find("span", attrs={"id": 'productTitle'}).string

    return title


if __name__ == "__main__":
    main()

Output: The title of https://www.amazon.com/dp/B094HWN66Y is: Will They, Won't They?