I am very new to web scraping. I am working on selenium
and want to perform the task to extract a text from span tags. The tags does not have any class and ids. The span tags are inside the li tags. I need to extract the text from a span tag and then click on the next link and extract the text from that span element that is inside of the li tag. I don't know how to do that. Could you please help me with that?
This is the code I am working with:
<div >
<div>
<ul >
<li >
<ul >
<li>
<!-- Default clicked -->
<span>VOI By Exchange</span>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html" target="_self">
<span>Agricultural</span></a>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" target="_self">
<span>Energy</span></a>
</li>
</ul>
</li>
</ul>
</div>
</div>
CodePudding user response:
EDIT
Since you need to use selenium, you can use XPATHs to locate elements when you don't have a tag on which you can refer to. From your favorite browser just F12, then right-click on the interested element and choose "Copy -> XPath". This is the solution proposed (I assume you have chrome and the chromedriver in the same folder of the .py file):
import os
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
url = "https://www.cmegroup.com/market-data/volume-open-interest/metals-volume.html"
i = 1
options = webdriver.ChromeOptions()
# this flag won't open a browser window, if you don't need the dev window uncomment this line
# options.add_argument("--headless")
driver = webdriver.Chrome(
options=options, executable_path=os.getcwd() "/chromedriver.exe"
)
driver.get(url)
while True:
xpath = f"/html/body/div[1]/div[2]/div/div[2]/div[2]/div/ul/li/ul/li[{i}]/a/span"
try:
res = driver.find_element(By.XPATH, xpath)
except NoSuchElementException:
# There are no more span elements in li
break
print(res.text)
i = 1
Results:
VOI By Exchange
Agricultural
Energy
Equities
FX
Interest Rates
You can extend this snippet to handle the .csv download from each page.
OLD
If you are working with a static html page (like the one you provided in the question) I suggest you to use BeautifulSoup. Selenium is more suited if you have to click, fill forms or interact with a web page. Here's a snippet with my solution:
from bs4 import BeautifulSoup
html_doc = """
<div >
<div>
<ul >
<li >
<ul >
<li>
<!-- Default clicked -->
<span>VOI By Exchange</span>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"
target="_self">
<span>Agricultural</span></a>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html"
target="_self">
<span>Energy</span></a>
</li>
</ul>
</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for span in soup.find_all("span"):
print(span.text)
And the result will be:
VOI By Exchange
Agricultural
Energy