I'm trying to scoop out four fields from a webpage using BeautifulSoup library. It's hard to identify the fields individually and that is the reason I seek help.
Sometimes both emails are present but that is not always the case. I used indexing to capture the email for this example but surely this is the worst idea to go with. Moreover, with the following attempt I can only parse the caption of the email, not the email address.
I've tried with (minimum working example):
from bs4 import BeautifulSoup
html = """
<p>
<strong>
Robert Romanoff
</strong>
<br/>
146 West 29th Street, Suite 11W
<br/>
New York, New York 10001
<br/>
Telephone: (718) 527-1577
<br/>
Fax: (718) 276-8501
<br/>
Email:
<a href="mailto:[email protected]">
[email protected]
</a>
<br/>
Additional Contact: William Locantro
<br/>
Email:
<a href="mailto:[email protected]">
[email protected]
</a>
</p>
"""
soup = BeautifulSoup(html,"lxml")
container = soup.select_one("p")
contact_name = container.strong.text.strip()
contact_email = [i for i in container.strings if "Email" in i][0].strip()
additional_contact = [i.strip() for i in container.strings if "Additional Contact" in i.strip()][0].strip('Additional Contact:')
additional_email = [i for i in container.strings if "Email" in i][1].strip()
print(contact_name,contact_email,additional_contact,additional_email)
Current output:
Robert Romanoff Email: William Locantro Email:
Expected output:
Robert Romanoff [email protected] William Locantro [email protected]
CodePudding user response:
Here is a solution you can give it a try,
import re
soup = BeautifulSoup(html, "lxml")
names_ = [
soup.select_one("p > strong").text.strip(),
soup.find(text=re.compile("Additional Contact:")).replace('Additional Contact:', '').strip()
]
email_ = [i.strip() for i in soup.find_all(text=re.compile("absol"))]
print(" ".join(i " " j for i, j in zip(names_, email_)))
Robert Romanoff [email protected] William Locantro [email protected]
CodePudding user response:
For more complex html/xml parsing you should take a look at xpath
which allows very powerful selector rules.
In python it's available in parsel
package.
from parsel import Selector
html = '...'
sel = Selector(html)
name = sel.xpath('//strong[1]/text()').get().strip()
email = sel.xpath("//text()[re:test(., 'Email')]/following-sibling::a/text()").get().strip()
name_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]").re("Additional Contact: (. )")[0]
email_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]/following-sibling::a/text()").get().strip()
print(name, email, name_additional, email_additional)
# Robert Romanoff [email protected] William Locantro [email protected]