Home > OS >  Selenium: Extracting the elements having no class
Selenium: Extracting the elements having no class

Time:08-31

I am very new to web scraping. I am working on selenium and want to perform the task to extract a text from span tags. The tags does not have any class and ids. The span tags are inside the li tags. I need to extract the text from a span tag and then click on the next link and extract the text from that span element that is inside of the li tag. I don't know how to do that. Could you please help me with that?

This is the code I am working with:

<div >
    <div>
        <ul >

            <li >
                <ul > 

                    <li>
                        <!-- Default clicked -->
                        <span>VOI By Exchange</span>
                    </li>

                    <li>
                                    
                        <a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"  target="_self">

                        <span>Agricultural</span></a>

                    </li>
                        
                    <li>

                        <a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html"  target="_self">

                        <span>Energy</span></a>
                    </li>
                </ul>
            </li>
        </ul>
    </div>
</div>

CodePudding user response:

EDIT

Since you need to use selenium, you can use XPATHs to locate elements when you don't have a tag on which you can refer to. From your favorite browser just F12, then right-click on the interested element and choose "Copy -> XPath". This is the solution proposed (I assume you have chrome and the chromedriver in the same folder of the .py file):

import os
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver

url = "https://www.cmegroup.com/market-data/volume-open-interest/metals-volume.html"

i = 1
options = webdriver.ChromeOptions()
# this flag won't open a browser window, if you don't need the dev window uncomment this line
# options.add_argument("--headless")
driver = webdriver.Chrome(
            options=options, executable_path=os.getcwd()   "/chromedriver.exe"
        )
        
driver.get(url)
while True:
    xpath = f"/html/body/div[1]/div[2]/div/div[2]/div[2]/div/ul/li/ul/li[{i}]/a/span"
    try:
        res = driver.find_element(By.XPATH, xpath)
    except NoSuchElementException:
        # There are no more span elements in li
        break 
    print(res.text)
    i  = 1

Results:

VOI By Exchange
Agricultural
Energy
Equities
FX
Interest Rates

You can extend this snippet to handle the .csv download from each page.

OLD

If you are working with a static html page (like the one you provided in the question) I suggest you to use BeautifulSoup. Selenium is more suited if you have to click, fill forms or interact with a web page. Here's a snippet with my solution:

from bs4 import BeautifulSoup

html_doc = """
    <div >
        <div>
            <ul >

                <li >
                    <ul >

                        <li>
                            <!-- Default clicked -->
                            <span>VOI By Exchange</span>
                        </li>

                        <li>

                            <a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"
                                 target="_self">

                                <span>Agricultural</span></a>

                        </li>

                        <li>

                            <a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" 
                                target="_self">

                                <span>Energy</span></a>
                        </li>
                    </ul>
                </li>
            </ul>
        </div>
    </div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

for span in soup.find_all("span"):
    print(span.text)

And the result will be:

VOI By Exchange
Agricultural
Energy
  • Related