I have HTML content as given below:
content ="<p >
Sector:
<a href="/company/compare/00000008/">
Capital Goods - Electrical Equipment
</a>
<span style="margin: 16px"></span>
Industry:
<a href="/company/compare/00000008/00000039/">
Electric Equipment
</a>"
</p>
I want to parse sector = Capital Goods - Electrical Equipment
and Industry=Electric Equipment
using BeautifulSoup
. Kindly guide me for same.
CodePudding user response:
To get the texts into a structured format like dict
with key / value pairs you can use a dict comprhension
:
dict([(x.previous.strip()[:-1],x.get_text(strip=True)) for x in soup.select('p.sub a')])
These is selecting all <a>
in your example, iterates the ResultSet
for the values and also extract the associated key.
{'Sector': 'Capital Goods - Electrical Equipment', 'Industry': 'Electric Equipment'}
Example
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
html ='''
<p >
Sector:
<a href="/company/compare/00000008/">
Capital Goods - Electrical Equipment
</a>
<span style="margin: 16px">
</span>
Industry:
<a href="/company/compare/00000008/00000039/">
Electric Equipment
</a>
"
</p>
'''
dict([(x.previous.strip()[:-1],x.get_text(strip=True)) for x in soup.select('p.sub a')])