Home > Back-end >  Can't locate and capture few fields out of some unstructured html
Can't locate and capture few fields out of some unstructured html

Time:10-09

I'm trying to scoop out four fields from a webpage using BeautifulSoup library. It's hard to identify the fields individually and that is the reason I seek help.

Sometimes both emails are present but that is not always the case. I used indexing to capture the email for this example but surely this is the worst idea to go with. Moreover, with the following attempt I can only parse the caption of the email, not the email address.

I've tried with (minimum working example):

from bs4 import BeautifulSoup

html = """
  <p>
   <strong>
    Robert Romanoff
   </strong>
   <br/>
   146 West 29th Street, Suite 11W
   <br/>
   New York, New York 10001
   <br/>
   Telephone: (718) 527-1577
   <br/>
   Fax: (718) 276-8501
   <br/>
   Email:
   <a href="mailto:[email protected]">
    [email protected]
   </a>
   <br/>
   Additional Contact: William Locantro
   <br/>
   Email:
   <a href="mailto:[email protected]">
    [email protected]
   </a>
  </p>
"""
soup = BeautifulSoup(html,"lxml")
container = soup.select_one("p")
contact_name = container.strong.text.strip()
contact_email = [i for i in container.strings if "Email" in i][0].strip()
additional_contact = [i.strip() for i in container.strings if "Additional Contact" in i.strip()][0].strip('Additional Contact:')
additional_email = [i for i in container.strings if "Email" in i][1].strip()
print(contact_name,contact_email,additional_contact,additional_email)

Current output:

Robert Romanoff Email: William Locantro Email:

Expected output:

Robert Romanoff [email protected] William Locantro [email protected]

CodePudding user response:

Here is a solution you can give it a try,

import re

soup = BeautifulSoup(html, "lxml")

names_ = [
    soup.select_one("p > strong").text.strip(),
    soup.find(text=re.compile("Additional Contact:")).replace('Additional Contact:', '').strip()
]

email_ = [i.strip() for i in soup.find_all(text=re.compile("absol"))]

print(" ".join(i   " "   j for i, j in zip(names_, email_)))

Robert Romanoff [email protected] William Locantro [email protected]

CodePudding user response:

For more complex html/xml parsing you should take a look at xpath which allows very powerful selector rules.

In python it's available in parsel package.

from parsel import Selector

html = '...'
sel = Selector(html)
name = sel.xpath('//strong[1]/text()').get().strip()
email = sel.xpath("//text()[re:test(., 'Email')]/following-sibling::a/text()").get().strip()
name_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]").re("Additional Contact: (. )")[0]
email_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]/following-sibling::a/text()").get().strip()
print(name, email, name_additional, email_additional)
# Robert Romanoff [email protected] William Locantro [email protected]
  • Related