How to programmatically detect if google is blocking me from making any further requests?-CodePudding

Context:

I have a web scrapper running in a hosted web app in Heroku.
It scrapes the google search page to get some required information.
I am using the request package

This is my code:

 # this is a method inside a class
 def get_weather_component(self):
        # Send request and store the webpage that comes as a response
        s = requests.Session()
        s.headers["User-Agent"] = self.USER_AGENT
        s.headers["Accept-Language"] = self.LANGUAGE
        s.headers["Content-Language"] = self.LANGUAGE
        html = s.get(self.url)

Note: I know that I can check the status code to see if it is err 429

Issues:

But can there be any other possible reason for the request being blocked that needs to be handled?
What is the minimum time gap between requests required for Google?

Any suggestion gratefully received. Thanks in advance.

CodePudding user response：

There is an API for Google Search. Google has probably placed a limit on the number of requests coming from the same IP.

slow down until you figure out the limit. (add a thread.sleep() or something like that)
run it on several servers allowing you to appear to come from different IP addresses. (deploy your application in a containerized environment)
stop trying to directly crawl Google for search data and try to use their RESTful API instead.

CodePudding user response：

import requests

headers = {"Content-Type": "application/json"}
data = {
    "Accept-Language": self.LANGUAGE
    "Content-Language": self.LANGUAGE
}

r = requests.get(self.url, data=data, headers=headers)
print(r.status_code)