I've created a script using scrapy implementing proxies within it to fetch content from a website. The script appears to be working correctly. The site I'm trying to grab data from is https://www.zillow.com/miami-fl-33166/
.
Since this is an https
site and I'm using https
proxies, I've set up a proxy like the following:
request.meta['proxy'] = 'https://123.200.20.242:58847'
However, when I execute the script today after accidentally changing https
to http
like the following, I could notice that the script still works.
request.meta['proxy'] = 'http://123.200.20.242:58847'
This is how I've implemented proxies within middleware:
def process_request(self, request, spider):
request.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
request.meta['proxy'] = 'https://123.200.20.242:58847'
# request.meta['proxy'] = 'http://123.200.20.242:58847'
And this is the reference:
DOWNLOADER_MIDDLEWARES = {
'customized_bot.proxy_middleware.ProxiesMiddleware': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
What is right way to set up
https
proxies within meta?
CodePudding user response:
Usage of https
proxy is not any different from using http
proxy. You simply need to change the proxy address from using http
to using https
. See this article on zyte.com on how to use https
proxy. To summarize, you can:
- Pass the proxy via
meta
object when making ascrapy.Request
- Setup a custom
scrapy middleware
that adds the proxy header to eachscrapy Request
. Mode details provided at zyte.com
To answer your question, http
and https
proxy can be used interchangeably to scrape http
and https
urls.