Context:
- I have a web scrapper running in a hosted web app in Heroku.
- It scrapes the google search page to get some required information.
- I am using the
request
package
This is my code:
# this is a method inside a class
def get_weather_component(self):
# Send request and store the webpage that comes as a response
s = requests.Session()
s.headers["User-Agent"] = self.USER_AGENT
s.headers["Accept-Language"] = self.LANGUAGE
s.headers["Content-Language"] = self.LANGUAGE
html = s.get(self.url)
Note: I know that I can check the status code
to see if it is err 429
Issues:
- But can there be any other possible reason for the request being blocked that needs to be handled?
- What is the minimum time gap between requests required for Google?
Any suggestion gratefully received. Thanks in advance.
CodePudding user response:
There is an API for Google Search. Google has probably placed a limit on the number of requests coming from the same IP.
- slow down until you figure out the limit. (add a thread.sleep() or something like that)
- run it on several servers allowing you to appear to come from different IP addresses. (deploy your application in a containerized environment)
- stop trying to directly crawl Google for search data and try to use their RESTful API instead.
CodePudding user response:
import requests
headers = {"Content-Type": "application/json"}
data = {
"Accept-Language": self.LANGUAGE
"Content-Language": self.LANGUAGE
}
r = requests.get(self.url, data=data, headers=headers)
print(r.status_code)