when i do scraping on amazon and google shopping with BeautifulSoup after a while (after about 100/200 products analyzed) it identifies me as a robot, how do i prevent this from happening?
By changing the ip I am able to restart, but after a while they block me again.
Here is my code:
from bs4 import BeautifulSoup
import requests
cookies_goo = {
"NID": "511=ktkACo_ZFBfZiD_DvYTKQFmYYX7R3Esh1ZtJ6A3F87KG_YzkbqlHc0NmQsGPyc78KIOXyCtVuYE9QmX-ixl-HzpbE9N9K67sGQCTZ2CFZ1oZAhe-iSFKtCcsUCsY8CHmbDu9YtxaEs7prgZqRID19DI6bqN2lxQZjog8HY6ur_M",
"1P_JAR": "2021-11-05-13",
"CONSENT": "YES cb.20211102-08-p0.it FX 548"
}
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36",
"Accept-Language": "it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7"
}
response = requests.get(url, headers=header, cookies=cookies_goo)
soup = BeautifulSoup(response.content, "lxml")
CodePudding user response:
You are a robot, so their algorithm is entirely correct. Try using their API instead.
CodePudding user response:
- Rotating proxies
- Delays
- Avoid the same pattern
- IP rate limit (probably your issue)
IP rate limit. It's a basic security system that can ban or block incoming requests from the same IP. It means that a regular user would not make 100 requests to the same domain in a few seconds with the exact same pattern (scroll, click, scroll, click, open. As an example).
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can use Google Shopping Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time figuring out how to bypass blocks from Google since it's already done for the end-user.
Example code to integrate to parse data from Google Shopping and example in the online IDE:
import os
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_product",
"product_id": "14506091995175728218", # can be iterated over multiple product ids
"gl": "us", # country to search from
"hl": "en" # language
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['product_results']['title']
prices = results['product_results']['prices']
reviews = results['product_results']['reviews']
rating = results['product_results']['rating']
extensions = results['product_results']['extensions']
description = results['product_results']['description']
user_reviews = results['product_results']['reviews']
reviews_results = results['reviews_results']['ratings']
print(f'{title}\n'
f'{prices}\n'
f'{reviews}\n'
f'{rating}\n'
f'{extensions}\n'
f'{description}\n'
f'{user_reviews}\n'
f'{reviews_results}')
'''
Google Pixel 4 White 64 GB, Unlocked
['$247.79', '$245.00', '$439.00']
526
3.7
['October 2019', 'Google', 'Pixel Family', 'Pixel 4', 'Android', '5.7″', 'Facial Recognition', '8 MP front camera', 'Smartphone', 'With Wireless Charging']
Point and shoot for the perfect photo. Capture brilliant color and control the exposure balance of different parts of your photos. Get the shot without the flash. Night Sight is now faster and easier to use it can even take photos of the Milky Way. Get more done with your voice. The new Google Assistant is the easiest way to send texts, share photos, and more. A new way to control your phone. Quick Gestures let you skip songs and silence calls – just by waving your hand above the screen. End the robocalls. With Call Screen, the Google Assistant helps you proactively filter our spam before your phone ever rings.
526
[{'stars': 1, 'amount': 101}, {'stars': 2, 'amount': 43}, {'stars': 3, 'amount': 39}, {'stars': 4, 'amount': 73}, {'stars': 5, 'amount': 270}]
'''
Example to iterate over multiple item ID's:
# import os
# from serpapi import GoogleSearch
# random numbers except the first one
products = ['14506091995175728218', '1450609199517512118', '145129895175728218']
for product in products:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_product",
"product_id": product,
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['product_results']['title']
print(title, sep='\n') # prints 3 titles from 3 different products
Disclaimer, I work for SerpApi.