Home > Back-end >  Is there someone have success in scraping from Amazon using Beautifulsoup?
Is there someone have success in scraping from Amazon using Beautifulsoup?

Time:12-17

I want to make a web scraper of Amazon.

But, It looks like that everydata is None type.

I found in google and there are many peoples who make a web scraper of Amazon.

Please, give me some advice to solve this Nonetype issue.

Here is my code:

import requests
from bs4 import BeautifulSoup

amazon_dir = requests.get("https://www.amazon.es/s?k=docking station&__mk_es_ES=ÅMÅŽÕÑ&crid=34FO3BVVCJS4V&sprefix=docking,aps,302&ref=nb_sb_ss_ts-doa-p_1_7")
amazon_soup = BeautifulSoup(amazon_dir.text, "html.parser")
product_table = amazon_soup.find("div", {"class": "sg-col-inner"})
print(product_table)

products = product_table.find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print(name, rating, price)

Thank you

CodePudding user response:

Portals may check header User-Agent to send different HTML for different browsers or devices and sometimes this can make problem to find elements on page.

But usually portals check this header to block scripts/bots.
For example requests sends User-Agent: python-requests/2.26.0.

If I use header User-Agent from real browser or at least shorter version Mozilla/5.0 then code works.


There is other problem.

There is almost 70 elements <div ...> and table is as 3th element but find() gives only first element. You have to use find_all() and later use [2] to get 3th element.


import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0',
}    

url = "https://www.amazon.es/s?k=docking station&__mk_es_ES=ÅMÅŽÕÑ&crid=34FO3BVVCJS4V&sprefix=docking,aps,302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)

print(response.text[:1000])
print('---')

amazon_soup = BeautifulSoup(response.text, "html.parser")
all_divs = amazon_soup.find_all("div", {"class": "sg-col-inner"})

print('len(all_divs):', len(all_divs))
print('---')

products = all_divs[3].find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print('name:', name.text)
print('rating:', rating.text)
print('price:', price.text)

EDIT:

Version which display all products:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0',
}    

url = "https://www.amazon.es/s?k=docking station&__mk_es_ES=ÅMÅŽÕÑ&crid=34FO3BVVCJS4V&sprefix=docking,aps,302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)

#print(response.text[:1000])
#print('---')

soup = BeautifulSoup(response.text, "html.parser")

results = soup.find("div", {"class": "s-main-slot s-result-list s-search-results sg-row"})

all_products = results.find_all("div", {"class": "sg-col-inner"})
print('len(all_products):', len(all_products))
print('---')

for item in all_products:
    name = item.find("span", {"class": "a-size-base-plus"})
    rating = item.find("span", {"class": "a-icon-alt"})
    price = item.find("span", {"class": "a-price-whole"})
    if name:
        print('name:', name.text)
    if rating:
        print('rating:', rating.text)
    if price:
        print('price:', price.text)
    if name or rating or price:
        print('---')

BTW:

From time to time portals refresh code and HTML on servers - so if you find tutorial then check how old it is. Older tutorials may not work because portals could changed something in code.

Many modern pages start using JavaScript to add elements but requests and BeautifulSoup can't run JavaScript. And this may need to use Selenium to control real web browser which can run JavaScript.

  • Related