Home > Enterprise >  How to programmatically detect if google is blocking me from making any further requests?
How to programmatically detect if google is blocking me from making any further requests?

Time:02-22

Context:

  • I have a web scrapper running in a hosted web app in Heroku.
  • It scrapes the google search page to get some required information.
  • I am using the request package

This is my code:

 # this is a method inside a class
 def get_weather_component(self):
        # Send request and store the webpage that comes as a response
        s = requests.Session()
        s.headers["User-Agent"] = self.USER_AGENT
        s.headers["Accept-Language"] = self.LANGUAGE
        s.headers["Content-Language"] = self.LANGUAGE
        html = s.get(self.url)

Note: I know that I can check the status code to see if it is err 429

Issues:

  • But can there be any other possible reason for the request being blocked that needs to be handled?
  • What is the minimum time gap between requests required for Google?

Any suggestion gratefully received. Thanks in advance.

CodePudding user response:

There is an API for Google Search. Google has probably placed a limit on the number of requests coming from the same IP.

  • slow down until you figure out the limit. (add a thread.sleep() or something like that)
  • run it on several servers allowing you to appear to come from different IP addresses. (deploy your application in a containerized environment)
  • stop trying to directly crawl Google for search data and try to use their RESTful API instead.

CodePudding user response:

import requests

headers = {"Content-Type": "application/json"}
data = {
    "Accept-Language": self.LANGUAGE
    "Content-Language": self.LANGUAGE
}

r = requests.get(self.url, data=data, headers=headers)
print(r.status_code)
  • Related