How to parse multiline attributes using beautifulsoup-CodePudding

I have HTML content as given below:

content ="<p >
        Sector:
        <a href="/company/compare/00000008/">
          Capital Goods - Electrical Equipment
        </a>
<span style="margin: 16px"></span>
        Industry:
        <a href="/company/compare/00000008/00000039/">
          Electric Equipment
        </a>"
</p>

I want to parse sector = Capital Goods - Electrical Equipment and Industry=Electric Equipment using BeautifulSoup . Kindly guide me for same.

CodePudding user response：

To get the texts into a structured format like dict with key / value pairs you can use a dict comprhension:

dict([(x.previous.strip()[:-1],x.get_text(strip=True)) for x in soup.select('p.sub a')])

These is selecting all <a> in your example, iterates the ResultSet for the values and also extract the associated key.

{'Sector': 'Capital Goods - Electrical Equipment', 'Industry': 'Electric Equipment'}

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

html ='''
<p >
   Sector:
   <a href="/company/compare/00000008/">
    Capital Goods - Electrical Equipment
   </a>
   <span style="margin: 16px">
   </span>
   Industry:
   <a href="/company/compare/00000008/00000039/">
    Electric Equipment
   </a>
   "
</p>
'''
dict([(x.previous.strip()[:-1],x.get_text(strip=True)) for x in soup.select('p.sub a')])