Home > Software design >  Get all text data from multiple classes with the same name with selenium in python
Get all text data from multiple classes with the same name with selenium in python

Time:09-28

I am trying to make a web scraper in python using selenium, and would like to get the text from embedded h3 tags, as well as the text in an "a" tag. The basic structure of the website is below.

<div >
        <h3>
             <a href="link that I do NOT want">Text That I want</a>
        </h3>
        <a href="Link that I want"></a>
</div>
<div >
        <h3>
             <a href="link that I do NOT want">More text that I want</a>
        </h3>
        <a href="another link that I want"></a>
</div>

How would I go about doing this? I've looked at xpath solutions as well as using

get_elements(By.CLASS_NAME, "class_name")

but I can't seem to get anything to work. I was thinking of getting each class location and iterating through each of them separately, but I have no clue how to do that. Any help is appreciated!

CodePudding user response:

I managed to pull out the second links through .contents. d.a['href'] somehow refers to an internal link in <h3>?

from bs4 import BeautifulSoup

data = """

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>Web page example</title>
    </head>
    <body>
        <h1>Caption</h1>
        <!-- Comment -->
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
        <div >
            <h3>
                <a href="link that I do NOT want">Text That I want</a>
            </h3>
            <a href="Link that I want"></a>
        </div>
        <div >
            <h3>
                <a href="link that I do NOT want">More text that I want</a>
            </h3>
            <a href="another link that I want"></a>
        </div>
    </body>
</html>
"""


soup = BeautifulSoup(data, features="lxml")
divs = soup.find_all('div', class_="class_name")

for d in divs:
    print(f"text = {d.h3.a.text}")
    print(f"text = {d.a['href']}?????????????")
    print(f"href = {d.contents[3]['href']}!!!")
text = Text That I want
text = link that I do NOT want?????????????
href = Link that I want!!!
text = More text that I want
text = link that I do NOT want?????????????
href = another link that I want!!!
  • Related