Code to scrape etherscan transaction block id
def block_chain_explorer_block_id(url):
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
tags = soup.findAll('div', attrs = {'class':'col-md-9'})
print(soup.findAll('a'))
block_chain_explorer_block_id(https://etherscan.io/tx/0x4529e9f79139edab871a699df455e57101cca90574e435da89db457df4885c54)
Output getting :
[<a href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" rel="noopener noreferrer" target="_blank">Cloudflare</a>]
I am getting above output polygonscan
works fine. etherscan works fine.
any idea how to make it work ?
CodePudding user response:
Etherscan has an API (with a Free plan).
You should use it instead of trying to scrape it, here's the doc for Transactions : https://docs.etherscan.io/api-endpoints/stats
CodePudding user response:
Adding some headers to the request, to show up you might be a "browser" can provide momentary relief, but it is far from bulletproof.
You should also consider how often and at what speed you visit which of the target pages.
Use of rotating proxies is also a common approach.
Note There is no magic formula for this, as Cloudflare is constantly adapting its methods for detecting bot traffic. - Using an api as mentioned by @Speedlulu would be the best approach
Example
Added user agent
as one of the headers and also changed findAll()
to find_all()
cause this is the syntax you should use in new code.
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
def block_chain_explorer_block_id(url):
import requests
from bs4 import BeautifulSoup
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
tags = soup.find_all('div', attrs = {'class':'col-md-9'})
print(soup.find_all('a'))
block_chain_explorer_block_id('https://etherscan.io/tx/0x4529e9f79139edab871a699df455e57101cca90574e435da89db457df4885c54')