How can I parse or scrape the email from this HTML using selenium or beautiful soup-CodePudding

How can I parse the second a tag from this div section. When I tried it always select the first one from the div children. How can I select the second so I can get the email.

<div >
  Address:
  <div style="padding-left: 1em">
    Box 460
    <br />
    <a href="/canada/Clinton-Village.html"
      >100 Mile House, British Columbia V0K 2E0</a
    >
  </div>
  <br /><b>Enrollment:</b> 310<br />
  <b>Grade span:</b> K-7<br />
  <br /><b>School Type:</b> Standard School<br />
  <b>School Category:</b> Public School<br />
  <br /><b>Principal:</b> Mrs Donna Rodger<br />
  <b>Phone (verify before using):</b> (250) 395-2258<br />
  <b>Fax (verify before using):</b> (250) 395-3621<br />
  <b>E-mail:</b>

  <a href="mailto:[email protected]">[email protected]</a>
  <br />
</div>

I tried using Xpath

        emailElement = email_driver.find_element(By.XPATH, '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]')
        result_email = emailElement.find_element(By.TAG_NAME, "a")
        print(result_email.text)

Output

100 Mile House, British Columbia V0K 2E0

It keeps giving me the first tag. And I want to select select second tag

Expected output

[email protected]

I want to parse this section

<a href="mailto:[email protected]">[email protected]</a>

CodePudding user response：

Try with cssSelector/xpath instead of tagName.

cssSelector : By.cssSelector("a[href*='mailto:']")
or
xpath : By.xpath("//div[@class='col-md-4']/a[contains(@href,'mailto')]")

CodePudding user response：

Instead of

emailElement = email_driver.find_element(By.XPATH, '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]')
result_email = emailElement.find_element(By.TAG_NAME, "a")
print(result_email.text)

Try this:

emailElement = email_driver.find_element(By.XPATH, '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]')
result_email = emailElement.find_element(By.XPATH, ".//a[contains(@href,'mailto')]")
print(result_email.text)

You should also improve the '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]' XPath expression, but I can't help there since you didn't share details about that.
You also possibly will have to use WebDriverWait expected conditions to wait for element presence or visibility.

CodePudding user response：

There are many ways you can identify the element

Option 1: Find the tag which contains E-mail text and then find next sibling anchor tag

print(email_driver.find_element(By.XPATH, "//b[text()='E-mail:']/following-sibling::a[1]").text)

Option 2: Find the tag which contains E-mail text and then find next anchor tag

print(email_driver.find_element(By.XPATH, "//b[text()='E-mail:']/following::a[1]").text)

Option 3: Find the anchor tag, href starts-with() mailto

print(email_driver.find_element(By.XPATH, "//a[starts-with(@href,'mailto')]").text)

Option 4: Find the anchor tag, href starts-with(^ in css selector) mailto

print(email_driver.find_element(By.CSS_SELECTOR, "a[href^='mailto']").text)