Home > Mobile >  Selenium Python: Extracting text from elements having no class
Selenium Python: Extracting text from elements having no class

Time:09-02

I am very new to web scraping. I am working on Selenium and want to perform the task to extract the texts from span tags. The tags do not have any class and ids. The span tags are inside the li tags. I need to extract the text from a span tags that are inside of the li tags. I don't know how to do that. Could you please help me with that?

HTML of the elements:

<div >
    <div>
        <ul >

            <li >
                <ul > 

                    <li>
                        <!-- Default clicked -->
                        <span>VOI By Exchange</span>
                    </li>

                    <li>
                                    
                        <a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"  target="_self">

                        <span>Agricultural</span></a>

                    </li>
                        
                    <li>

                        <a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html"  target="_self">

                        <span>Energy</span></a>
                    </li>
                </ul>
            </li>
        </ul>
    </div>
</div>

CodePudding user response:

The simplest way to do this is

for e in driver.find_elements(By.CSS_SELECTOR, "ul.cmeHorizontalList a")
    print(e.text)

Some pitfalls in other answers...

  1. You shouldn't use exceptions to control flow. It's just a bad practice and is slower.

  2. You shouldn't use Copy > XPath from a browser. Most times this generates XPaths that are very brittle. Any XPath that starts at the HTML tag, has more than a few levels, or uses a number of indices (e.g. div[2] and the like) is going to be very brittle. Any even minor change to the page will break that locator.

  3. Prefer CSS selectors over XPath. CSS selectors are better supported, faster, and the syntax is simpler.

CodePudding user response:

EDIT

Since you need to use selenium, you can use XPATHs to locate elements when you don't have a tag on which you can refer to. From your favorite browser just F12, then right-click on the interested element and choose "Copy -> XPath". This is the solution proposed (I assume you have chrome and the chromedriver in the same folder of the .py file):

import os
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver

url = "https://www.cmegroup.com/market-data/volume-open-interest/metals-volume.html"

i = 1
options = webdriver.ChromeOptions()
# this flag won't open a browser window, if you don't need the dev window uncomment this line
# options.add_argument("--headless")
driver = webdriver.Chrome(
            options=options, executable_path=os.getcwd()   "/chromedriver.exe"
        )
        
driver.get(url)
while True:
    xpath = f"/html/body/div[1]/div[2]/div/div[2]/div[2]/div/ul/li/ul/li[{i}]/a/span"
    try:
        res = driver.find_element(By.XPATH, xpath)
    except NoSuchElementException:
        # There are no more span elements in li
        break 
    print(res.text)
    i  = 1

Results:

VOI By Exchange
Agricultural
Energy
Equities
FX
Interest Rates

You can extend this snippet to handle the .csv download from each page.

OLD

If you are working with a static html page (like the one you provided in the question) I suggest you to use BeautifulSoup. Selenium is more suited if you have to click, fill forms or interact with a web page. Here's a snippet with my solution:

from bs4 import BeautifulSoup

html_doc = """
    <div >
        <div>
            <ul >

                <li >
                    <ul >

                        <li>
                            <!-- Default clicked -->
                            <span>VOI By Exchange</span>
                        </li>

                        <li>

                            <a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"
                                 target="_self">

                                <span>Agricultural</span></a>

                        </li>

                        <li>

                            <a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" 
                                target="_self">

                                <span>Energy</span></a>
                        </li>
                    </ul>
                </li>
            </ul>
        </div>
    </div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

for span in soup.find_all("span"):
    print(span.text)

And the result will be:

VOI By Exchange
Agricultural
Energy

CodePudding user response:

To extract the desired texts e.g. VOI By Exchange, Agricultural, Energy, etc you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.cmeHorizontalList.cmeListSeparator li span")))])
    
  • Using XPATH:

    driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[@id='onetrust-accept-btn-handler']"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='cmeHorizontalList cmeListSeparator']//li//span")))])
    
  • Console Output:

    ['VOI By Exchange', 'Agricultural', 'Energy', 'Equities', 'FX', 'Interest Rates', 'Metals']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Related