Home > Net >  How to find elements in soup by Tag and specific attribute
How to find elements in soup by Tag and specific attribute

Time:11-09

I have an HTML file structure as below with hundred of such elements in main tag:

<main>
    <div id="rows">
        <p data-name="First Element">
            <a target="_blank" href="localhost">
                <strong>First Element</strong>
            </a>
            <strong  data-date="2016-06-27">6 years</strong>
        </p>
        <p data-name="Second Element">
            <a target="_blank" href="localhost">
                <strong>Second Element</strong>
            </a>
            <strong  data-date="2016-06-27">6 years</strong>
        </p>
    </div>    
</main>

  1. In the first step, I want to find all <p> elements that have the data-name attribute,
  2. and next for each p element I want to select the strong tag which has the data-date attribute
  3. and the last extract these tags href and values in selected tags
from bs4 import BeautifulSoup


file = open('index.html', 'r')
file_text = file.read()

soup = BeautifulSoup(file_text, "html.parser")

results = soup.find("main").find(id="rows")

CodePudding user response:

You can get all the <p> first then for every p tag, check if the p tag has data-name attribute, and also check if any strong child of that p tag has data-date attribute, if all is true, then just extract the data:

from bs4 import BeautifulSoup

file = open('index.html', 'r')
file_text = file.read()

soup = BeautifulSoup(file_text, "html.parser")

results = soup.find("main").find(id="rows")

ps = results.findAll('p')
for p in ps:
    if p.has_attr('data-name'):
        strongs = p.findAll('strong')
        for strong in strongs:
            if strong.has_attr('data-date'):
                a_els = p.findAll('a')
                for a in a_els:
                    print(a['href'])
                    print(strong['data-date'])

CodePudding user response:

You could use also css selectors to select elements specifc, get your goal and simplify logic:

soup.select('main #rows p[data-name]:has(strong[data-date])')

Example

from bs4 import BeautifulSoup

html='''
<main>
    <div id="rows">
        <p data-name="First Element">
            <a target="_blank" href="localhost">
                <strong>First Element</strong>
            </a>
            <strong  data-date="2016-06-27">6 years</strong>
        </p>
        <p data-name="Second Element">
            <a target="_blank" href="localhost">
                <strong>Second Element</strong>
            </a>
            <strong  data-date="2016-06-27">6 years</strong>
        </p>
    </div>
</main>
'''
soup = BeautifulSoup(html)
data = []

for e in soup.select('p[data-name]:has(strong[data-date])'):
    data.append({
        'link':e.a.get('href'),
        'linkText':e.a.get_text(strip=True),
        'date':e.strong.get('data-date'),
        'strongText': e.strong.get_text(strip=True)
    })
data

Output

[{'link': 'localhost',
  'linkText': 'First Element',
  'date': '2016-06-27',
  'strongText': '6 years'},
 {'link': 'localhost',
  'linkText': 'Second Element',
  'date': '2016-06-27',
  'strongText': '6 years'}]
  • Related