I have an HTML file structure as below with hundred of such elements in main tag:
<main>
<div id="rows">
<p data-name="First Element">
<a target="_blank" href="localhost">
<strong>First Element</strong>
</a>
<strong data-date="2016-06-27">6 years</strong>
</p>
<p data-name="Second Element">
<a target="_blank" href="localhost">
<strong>Second Element</strong>
</a>
<strong data-date="2016-06-27">6 years</strong>
</p>
</div>
</main>
- In the first step, I want to find all
<p>
elements that have thedata-name
attribute, - and next for each p element I want to select the
strong
tag which has thedata-date
attribute - and the last extract these tags href and values in selected tags
from bs4 import BeautifulSoup
file = open('index.html', 'r')
file_text = file.read()
soup = BeautifulSoup(file_text, "html.parser")
results = soup.find("main").find(id="rows")
CodePudding user response:
You can get all the <p>
first then for every p
tag, check if the p
tag has data-name
attribute, and also check if any strong
child of that p
tag has data-date
attribute, if all is true, then just extract the data:
from bs4 import BeautifulSoup
file = open('index.html', 'r')
file_text = file.read()
soup = BeautifulSoup(file_text, "html.parser")
results = soup.find("main").find(id="rows")
ps = results.findAll('p')
for p in ps:
if p.has_attr('data-name'):
strongs = p.findAll('strong')
for strong in strongs:
if strong.has_attr('data-date'):
a_els = p.findAll('a')
for a in a_els:
print(a['href'])
print(strong['data-date'])
CodePudding user response:
You could use also css selectors
to select elements specifc, get your goal and simplify logic:
soup.select('main #rows p[data-name]:has(strong[data-date])')
Example
from bs4 import BeautifulSoup
html='''
<main>
<div id="rows">
<p data-name="First Element">
<a target="_blank" href="localhost">
<strong>First Element</strong>
</a>
<strong data-date="2016-06-27">6 years</strong>
</p>
<p data-name="Second Element">
<a target="_blank" href="localhost">
<strong>Second Element</strong>
</a>
<strong data-date="2016-06-27">6 years</strong>
</p>
</div>
</main>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('p[data-name]:has(strong[data-date])'):
data.append({
'link':e.a.get('href'),
'linkText':e.a.get_text(strip=True),
'date':e.strong.get('data-date'),
'strongText': e.strong.get_text(strip=True)
})
data
Output
[{'link': 'localhost',
'linkText': 'First Element',
'date': '2016-06-27',
'strongText': '6 years'},
{'link': 'localhost',
'linkText': 'Second Element',
'date': '2016-06-27',
'strongText': '6 years'}]