Home > database >  Problem with web scraping - JavaSript on website is disabled
Problem with web scraping - JavaSript on website is disabled

Time:02-13

Hello,

I've been playing with discord bots (in Python) for a while now and I've come across a problem with scraping information on some websites that protect themselves from data collection by disabling javascript on their side so you can't get to their data.

I have already looked at many websites recommending changing in headers among other things, but it has not helped.

The next step was to use selenium, which returns me this information.

We're sorry but Hive-Engine Explorer doesn't work properly without JavaScript enabled. Please enable it to continue.

Code:

  chrome_options = Options()
  chrome_options.add_argument('--no-sandbox')
  chrome_options.add_argument("--disable-gpu")
  chrome_options.add_argument('--disable-dev-shm-usage')

  driver = webdriver.Chrome(options=chrome_options)
  driver.get("https://he.dtools.dev/richlist/BEE")
  htmlSource = driver.page_source
  print(htmlSource)

I also checked how it looks like on the browser side itself and as we can see after entering the page there is no way to see the html file

Image from website

My question is, is it possible to bypass such security measures? Unfortunately I wanted to download the information from the API but it is not possible in this case.

CodePudding user response:

You don't need to run Selenium to get this data, the site uses a backend api to deliver the data which you can replicate easily in python:

import requests
import pandas as pd
import time
import json

token = 'BEE'
limit = 100

id_ = int(time.time())

headers =   {
    'accept':'application/json, text/plain, */*',
    'content-type':'application/json',
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.3'
    }
url = 'https://api.hive-engine.com/rpc/contracts'

payload = {"jsonrpc":"2.0","id":id_,"method":"find","params":{"contract":"tokens","table":"balances","query":{"symbol":token},"offset":0,"limit":limit}}
resp = requests.post(url,headers=headers,data=json.dumps(payload)).json()

df= pd.DataFrame(resp['result'])
df.to_csv('HiveData.csv',index=False)

print('Saved to HiveData.csv')
  • Related