def scrape_asx():
driver= get_driver()
#driver.get('http://www.asx.com.au/asx/statistics/prevBusDayAnns.do')
driver.execute_script("window.open('http://www.asx.com.au/asx/statistics/prevBusDayAnns.do', '_self');")
driver.maximize_window()
time.sleep(5)
driver.implicitly_wait(10)
driver.find_element_by_css_selector('#onetrust-accept-btn-handler').click()
time.sleep(10)
driver.implicitly_wait(5)
text = driver.execute_script("return document.getElementsByTagName('tr')[1].cells[3].firstElementChild.firstChild;")
return text
Hi Guys, please help me with this issue, I have been stuck here for a day.
I am trying to extract the title from the contents here.
And is the console of the website.
Please help!!Thanks in advance.
CodePudding user response:
The information you're looking for is in an iframe, so - using selenium - you would need to switch to iframe, and then locate it. However, selenium is not a web scraping tool, but a testing tool: it should be the last call when trying to extract information from a website. The optimal way forward here is to scrape the source of that iframe. If you are only after the titles in that list, you can do the following:
url = 'https://www.asx.com.au/asx/v2/statistics/todayAnns.do'
df = pd.read_html(url)[0]
print(df)
The result printed in terminal:
ASX Code Date Price sens. Headline
0 MOH 03/10/2022 8:16 PM NaN Notice Under Section 708A 1 page 276.3KB
1 MOH 03/10/2022 8:14 PM NaN Application for quotation of securities - MOH 8 pages 29.6KB
2 92E 03/10/2022 8:03 PM NaN Letter to Shareholders 1 page 137.2KB
3 92E 03/10/2022 8:02 PM NaN Notice of Annual General Meeting/Proxy Form 37 pages 673.7KB
4 NST 03/10/2022 8:01 PM NaN Notice of 2022 Annual General Meeting 38 pages 1.4MB
... ... ... ... ...
695 VHT 03/10/2022 7:33 AM NaN Volpara CEO appointed to the Board as Managing Director 2 pages 148.1KB
696 A2M 03/10/2022 7:33 AM NaN Renewal of arrangements with China State Farm 2 pages 254.1KB
697 PVL 03/10/2022 7:32 AM NaN Appendix 4G and Corporate Governance Statement 20 pages 446.3KB
698 NTL 03/10/2022 7:32 AM NaN Cleansing Notice 1 page 153.6KB
699 NTL 03/10/2022 7:32 AM NaN Proposed issue of securities - NTL 5 pages 27.5KB
700 rows × 4 columns
If you need more info from that page, you can extract it with Requests & BeautifulSoup, without the overheads of Selenium.
Relevant pandas docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html