Unable to scrape etherscan transaction urls - cloudflare protection-CodePudding

Code to scrape etherscan transaction block id

def block_chain_explorer_block_id(url):
    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    tags = soup.findAll('div', attrs = {'class':'col-md-9'}) 
    print(soup.findAll('a'))

block_chain_explorer_block_id(https://etherscan.io/tx/0x4529e9f79139edab871a699df455e57101cca90574e435da89db457df4885c54)

Output getting :

[<a href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" rel="noopener noreferrer" target="_blank">Cloudflare</a>]

I am getting above output polygonscan works fine. etherscan works fine. any idea how to make it work ?

CodePudding user response：

Etherscan has an API (with a Free plan).

You should use it instead of trying to scrape it, here's the doc for Transactions : https://docs.etherscan.io/api-endpoints/stats

CodePudding user response：

Adding some headers to the request, to show up you might be a "browser" can provide momentary relief, but it is far from bulletproof.

You should also consider how often and at what speed you visit which of the target pages.

Use of rotating proxies is also a common approach.

Note There is no magic formula for this, as Cloudflare is constantly adapting its methods for detecting bot traffic. - Using an api as mentioned by @Speedlulu would be the best approach

Example

Added user agent as one of the headers and also changed findAll() to find_all() cause this is the syntax you should use in new code.

import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

def block_chain_explorer_block_id(url):
    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get(url,headers=headers)
    soup = BeautifulSoup(r.content, 'html5lib')
    tags = soup.find_all('div', attrs = {'class':'col-md-9'}) 
    print(soup.find_all('a'))

block_chain_explorer_block_id('https://etherscan.io/tx/0x4529e9f79139edab871a699df455e57101cca90574e435da89db457df4885c54')