I have searched but I didn't find a solution for my question. These days I work on web scraping with BeautifulSoup in Python. So I have faced a problem when I using it. To scrape fast I tried to run my code more than one but when running with more than 3 programs same time web page blocked me for a while(I run the programs separately on different scripts). Because of this is there any technique to find what is a website's request limit for same IP or user etc.? If there is nothing to do with this, how can I find optimum request limit to implement to a website? Thank you in advance.
CodePudding user response:
There is no easy way to find a hard limit for how many requests per second a website can accept. This is up to the server configuration and the hosting provider. However, when doing web scraping it's very important to be cognizant of the load you are imposing on the website's servers.
I would recommend the following:
- Always respect a website's
robots.txt
file -- you can usually find this at the root of the website. - If you're set on using the requests library, make sure to add enough
time.sleep()
in your code in order to pace your requests and not overload the server. - A better solution would involve using Selenium and its suite of waits, which makes it easier to browse a website like a normal person would.
The bottom line is to not abuse the websites you are scraping, but to do it in a polite way, like a human would. This ensures that the websites will still be up next time you want to scrape them.
CodePudding user response:
High volume apis like reddit,twitter,fb etc. return response-headers:
X-Ratelimit-Used: Approximate number of requests used in this period
X-Ratelimit-Remaining: Approximate number of requests left to use
X-Ratelimit-Reset: Approximate number of seconds to end of period
https://developer.twitter.com/en/docs/twitter-api/rate-limits
You get response-headers via requests r = requests.get(); r.headers
before you put the r.text
into bs