I am trying to scrape ETFs from the website https://www.etf.com/channels. However no matter what I try it returns a 503 error when trying to access it. I've tried using different user agents as well as headers but it still wouldn't let me access it. Sometimes when I try to access the website by browser a page pops up that "checks if the connection is secure" So I assume they have things in place to stop scraping. I've seen others ask the same question and the answer always says to add a user agent but that didn't work for this site.
Scrapy
class BrandETFs(scrapy.Spider):
name = "etfs"
start_urls = ['https://www.etf.com/channels']
headers = {
"Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "www.etf.com",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0"
}
custom_settings = {'DOWNLOAD_DELAY': 0.3, "CONCURRENT_REQUESTS": 4}
def start_requests(self):
url = self.start_urls[0]
yield scrapy.Request(url=url)
def parse(self, response):
test = response.css('div.discovery-slat')
yield {
"test": test
}
Requests
import requests
url = 'https://www.etf.com/channels'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Referer': 'https://google.com',
'Origin': 'https://www.etf.com'
}
r = requests.post(url, headers=headers)
r.raise_for_status()
Is there anyway to get around these blocks and access the website?
CodePudding user response:
Status 503 - Service Unavailable
is often seen in such cases, you are probably right with your assumption that they have taken measures against scraping.
For the sake of completeness, they prohibit what you are attempting in their Terms of Service (No. 7g):
[...] You agree that you will not [...] Use automated means, including spiders, robots, crawlers [...]
Technical point of view
The User-Agent
in the header is just one of many things that you should consider when you try to hide the fact that you automated the requests you are sending.
Since you see a page that seems to verify that you are still/again a human, it is likely that they have figured out what is going on and
have an eye on your IP. It might not be blacklisted (yet) because they notice changes whenever you try to access the page.
How did they find out? Based on your question and code, I guess it's just your IP that did not change in combination with
- Request rate: You have sent (too many) requests too quickly, i.e. faster than they consider a human to do this.
- Periodic requests: Static delays between requests, so they see pretty regular timing on their side.
There are several other aspects that might or might not be monitored. However, using proxies (i.e. changing IP addresses) would be a step in the right direction.