Home > other >  Scrapy crawler
Scrapy crawler

Time:10-15

Climbing on a little data, web page caught process is: the original 302 to a verification url, finished back 302 display content,
When using scrapy framework, using a proxy IP will be a 302 redirect to other sites, then you can't get data, don't use the proxy server can normal crawl data, but for a while will be blocked, the crawler ua is random, other request header according to the original page to add, proxy IP can also tested, research for several days not to make, bosses, please, what is the problem, how to solve?

Only this point, the bosses to help

CodePudding user response:

1. In the use of scrapy framework, use the proxy server IP will be a 302 redirect to other sites, and then can't get data, don't use the proxy server can normal crawl data ":
- big probability agent you use not high and the background identified and data limitations, in a batch of high quality agent? Buy buy buy,

2. Do not use "agent can crawl data normally, but will be for a while will be blocked, the crawler ua is random, other request header according to the original page to add,"
- try to reduce the crawl frequency, frequent random ua also easy to climb, suggest increase the sleep time, from the high end of the test server detection threshold,

Other: check the Settings file configuration items are modified accordingly,

CodePudding user response:

reference 1st floor weixin_41768513 response:
1. "in the use of scrapy framework, using a proxy IP will be a 302 redirect to other sites, and then can't get data, don't use the proxy server can normal crawl data" :
- big probability agent you use not high and the background identified and data limitations, in a batch of high quality agent? Buy buy buy,

2. Do not use "agent can crawl data normally, but will be for a while will be blocked, the crawler ua is random, other request header according to the original page to add,"
- try to reduce the crawl frequency, frequent random ua also easy to climb, suggest increase the sleep time, from the high end of the test server detection threshold,

Other: check the Settings file configuration items are modified accordingly,


Program using scrapy_redis, set up the retry and redirect appeared to have no work, another scrapy script can get to the data through two jump
  • Related