I'm trying to automate data extraction from ASX (https://www.asxenergy.com.au/futures_nz) website into my database by writing a web scraping python script and deploying it in Azure Databrick. Currently, the script I have is working in Visual Studio Code, but when I try to run it in databrick, it crashes, throwing the error below.
Could not get version for google-chrome with the command: google-chrome --version || google-chrome-stable --version || google-chrome-beta --version || google-chrome-dev --version
I believe I will need to simplify my code in order to obtain the table without mentioning the we browser.
My sample code is below:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import sys
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(ChromeDriverManager().install())
#browser = webdriver.Chrome('C:/chromedriver',options=options) # Optional argument, if not specified will search path.
browser.get('https://www.asxenergy.com.au/futures_nz')
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html,'html.parser')
market_dataset = soup.find_all(attrs={'class':'market-dataset'})
market_dataset
I tried to use the below code instead, with just the request
package, but it failed since it couldn't find the 'market-dataset' div class
.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import sys
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
URL = "https://www.asxenergy.com.au/futures_nz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
market_dataset = soup.findAll("div",href=True,attrs={'class':'market-dataset'})
Can anyone please help me.
CodePudding user response:
This page uses JavaScript to load table from https://www.asxenergy.com.au/futures_nz/dataset
Server checks if it is AJAX/XHR request so it needs header
'X-Requested-With': 'XMLHttpRequest'
But your findAll("div",href=True, ...
tries to find <div href="...">
but this page doesn't have it - so I search normal <div>
with
Minimal working code.
import requests
from bs4 import BeautifulSoup
headers = {
# 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'X-Requested-With': 'XMLHttpRequest'
}
URL = "https://www.asxenergy.com.au/futures_nz/dataset"
response = requests.get(URL, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
market_dataset = soup.findAll("div", attrs={'class':'market-dataset'})
print('len(market_dataset):', len(market_dataset))
Result:
len(market_dataset): 10