I have the content below and I am trying to understand how to extract the <p>
tag copy using Beautiful Soup (I am open to other methods). As you can see the <p>
tags are not both nested inside the same <div>
. I gave it a shot with the following method but that only seems to work when both <p>
tags are within the same container.
<div class="top-panel">
<div class="inside-panel-0">
<h1 class="h1-title">Some Title</h1>
</div>
<div class="inside-panel-0">
<div class="inside-panel-1">
<p> I want to extract this copy</p>
</div>
<div class="inside-panel-1">
<p>I want to extract this copy</p>
</div>
</div>
</div>
CodePudding user response:
IIUC try
from bs4 import BeautifulSoup
html = """<div >
<div >
<h1 >Some Title</h1>
</div>
<div >
<div >
<p> I want to extract this copy</p>
</div>
<div >
<p>I want to extract this copy</p>
</div>
</div>
</div>"""
soup = BeautifulSoup(html, 'lxml')
# find all the p tags that have a parent class of inside-panel-1
soup.findAll({'p': {'class': 'inside-panel-1'}})
[<p> I want to extract this copy</p>, <p>I want to extract this copy</p>]
If you want just the text then try
p_tags = soup.findAll({'p': {'class': 'inside-panel-1'}})
[elm.text for elm in p_tags]
# -> [' I want to extract this copy', 'I want to extract this copy']
CodePudding user response:
As p tags are inside Output:div lang-py s-code-block">
from bs4 import BeautifulSoup
html = """
<div >
<div >
<h1 >
Some Title
</h1>
</div>
<div >
<div >
<p>
I want to extract this copy
</p>
</div>
<div >
<p>
I want to extract this copy
</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# print(soup.prettify())
p_tags = soup.select('div.top-panel div[]')
for p_tag in p_tags:
print(p_tag.get_text(strip=True))
I want to extract this copy
I want to extract this copy