I want to make a web scraper of Amazon.
But, It looks like that everydata is None type.
I found in google and there are many peoples who make a web scraper of Amazon.
Please, give me some advice to solve this Nonetype issue.
Here is my code:
import requests
from bs4 import BeautifulSoup
amazon_dir = requests.get("https://www.amazon.es/s?k=docking station&__mk_es_ES=ÅMÅŽÕÑ&crid=34FO3BVVCJS4V&sprefix=docking,aps,302&ref=nb_sb_ss_ts-doa-p_1_7")
amazon_soup = BeautifulSoup(amazon_dir.text, "html.parser")
product_table = amazon_soup.find("div", {"class": "sg-col-inner"})
print(product_table)
products = product_table.find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print(name, rating, price)
Thank you
CodePudding user response:
Portals may check header User-Agent
to send different HTML for different browsers or devices and sometimes this can make problem to find elements on page.
But usually portals check this header to block scripts/bots.
For example requests
sends User-Agent: python-requests/2.26.0
.
If I use header User-Agent
from real browser or at least shorter version Mozilla/5.0
then code works.
There is other problem.
There is almost 70 elements <div ...>
and table is as 3th element but find()
gives only first element. You have to use find_all()
and later use [2]
to get 3th element.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0',
}
url = "https://www.amazon.es/s?k=docking station&__mk_es_ES=ÅMÅŽÕÑ&crid=34FO3BVVCJS4V&sprefix=docking,aps,302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)
print(response.text[:1000])
print('---')
amazon_soup = BeautifulSoup(response.text, "html.parser")
all_divs = amazon_soup.find_all("div", {"class": "sg-col-inner"})
print('len(all_divs):', len(all_divs))
print('---')
products = all_divs[3].find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print('name:', name.text)
print('rating:', rating.text)
print('price:', price.text)
EDIT:
Version which display all products:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0',
}
url = "https://www.amazon.es/s?k=docking station&__mk_es_ES=ÅMÅŽÕÑ&crid=34FO3BVVCJS4V&sprefix=docking,aps,302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)
#print(response.text[:1000])
#print('---')
soup = BeautifulSoup(response.text, "html.parser")
results = soup.find("div", {"class": "s-main-slot s-result-list s-search-results sg-row"})
all_products = results.find_all("div", {"class": "sg-col-inner"})
print('len(all_products):', len(all_products))
print('---')
for item in all_products:
name = item.find("span", {"class": "a-size-base-plus"})
rating = item.find("span", {"class": "a-icon-alt"})
price = item.find("span", {"class": "a-price-whole"})
if name:
print('name:', name.text)
if rating:
print('rating:', rating.text)
if price:
print('price:', price.text)
if name or rating or price:
print('---')
BTW:
From time to time portals refresh code and HTML on servers - so if you find tutorial then check how old it is. Older tutorials may not work because portals could changed something in code.
Many modern pages start using JavaScript to add elements but requests
and BeautifulSoup
can't run JavaScript
. And this may need to use Selenium to control real web browser which can run JavaScript
.