Home > OS >  How can I parse or scrape the email from this HTML using selenium or beautiful soup
How can I parse or scrape the email from this HTML using selenium or beautiful soup

Time:09-21

How can I parse the second a tag from this div section. When I tried it always select the first one from the div children. How can I select the second so I can get the email.

<div >
  Address:
  <div style="padding-left: 1em">
    Box 460
    <br />
    <a href="/canada/Clinton-Village.html"
      >100 Mile House, British Columbia V0K 2E0</a
    >
  </div>
  <br /><b>Enrollment:</b> 310<br />
  <b>Grade span:</b> K-7<br />
  <br /><b>School Type:</b> Standard School<br />
  <b>School Category:</b> Public School<br />
  <br /><b>Principal:</b> Mrs Donna Rodger<br />
  <b>Phone (verify before using):</b> (250) 395-2258<br />
  <b>Fax (verify before using):</b> (250) 395-3621<br />
  <b>E-mail:</b>

  <a href="mailto:[email protected]">[email protected]</a>
  <br />
</div>

I tried using Xpath

        emailElement = email_driver.find_element(By.XPATH, '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]')
        result_email = emailElement.find_element(By.TAG_NAME, "a")
        print(result_email.text)

Output

100 Mile House, British Columbia V0K 2E0

It keeps giving me the first tag. And I want to select select second tag

Expected output

[email protected]

I want to parse this section

<a href="mailto:[email protected]">[email protected]</a>

CodePudding user response:

Try with cssSelector/xpath instead of tagName.

cssSelector : By.cssSelector("a[href*='mailto:']")
or
xpath : By.xpath("//div[@class='col-md-4']/a[contains(@href,'mailto')]")

CodePudding user response:

Instead of

emailElement = email_driver.find_element(By.XPATH, '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]')
result_email = emailElement.find_element(By.TAG_NAME, "a")
print(result_email.text)

Try this:

emailElement = email_driver.find_element(By.XPATH, '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]')
result_email = emailElement.find_element(By.XPATH, ".//a[contains(@href,'mailto')]")
print(result_email.text)

You should also improve the '//*[@id="main_body"]/div[3]/div[1]/div[1]/div[1]/div[1]' XPath expression, but I can't help there since you didn't share details about that.
You also possibly will have to use WebDriverWait expected conditions to wait for element presence or visibility.

CodePudding user response:

There are many ways you can identify the element

Option 1: Find the tag which contains E-mail text and then find next sibling anchor tag

print(email_driver.find_element(By.XPATH, "//b[text()='E-mail:']/following-sibling::a[1]").text)

Option 2: Find the tag which contains E-mail text and then find next anchor tag

print(email_driver.find_element(By.XPATH, "//b[text()='E-mail:']/following::a[1]").text)

Option 3: Find the anchor tag, href starts-with() mailto

print(email_driver.find_element(By.XPATH, "//a[starts-with(@href,'mailto')]").text)

Option 4: Find the anchor tag, href starts-with(^ in css selector) mailto

print(email_driver.find_element(By.CSS_SELECTOR, "a[href^='mailto']").text)
  • Related