I have an html page something as follow i want to extract the data after for each heading along with heading, i cant directly use find all with li tags as miss which li tags belong to which heading, I am new to scraping so not sure how this work i want output as something like heading 1 and all the li tags text from that heading till next and then next heading and its data
<div class = "some div class">
<div class = "some other div">
<h3> first title </h3>
<ul>
<li>Square No.1477: Rodman Street. NW And Fordham Street NW</li>
<li>Square No.1586: Davenport Street. NW And 44th Street NW</li>
<li>Square No.1738: Garrison Street. NW And 41St Street NW</li>
<li>Square No.2997:Ingraham Street. NW And Georgia Avenue NW</li>
<li>Square No.3145: Illinois Avenue NW, And Decatur Street NW</li>
<li>Square No.3292: Madison Street. NW And 3Rd Place NW</li>
<li>Square No.3337: 2nd Street. NW And Oglethorpe Street NW</li>
<li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
<li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
</ul>
<h3>Second title</h3>
<ul>
<li>Square No.177: Fordham Street NW</li>
<li>Square No.186: 44th Street NW</li>
<li>Square No.138: 41St Street NW</li>
<li>Square No.997:Ingraham Georgia Avenue NW</li>
<li>Square No.314: Decatur Street NW</li>
<li>Square No.3292: Madison Street. NW And 3Rd Place NW</li>
<li>Square No.333: Oglethorpe Street NW</li>
<li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
<li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
</ul>
</div>
</div>
CodePudding user response:
If you want to group li
tags with <h3>
tags as a heading, you can use .find_previous
:
from bs4 import BeautifulSoup
html = """\
<div class = "some div class">
<div class = "some other div">
<h3> first title </h3>
<ul>
<li>Square No.1477: Rodman Street. NW And Fordham Street NW</li>
<li>Square No.1586: Davenport Street. NW And 44th Street NW</li>
<li>Square No.1738: Garrison Street. NW And 41St Street NW</li>
<li>Square No.2997:Ingraham Street. NW And Georgia Avenue NW</li>
<li>Square No.3145: Illinois Avenue NW, And Decatur Street NW</li>
<li>Square No.3292: Madison Street. NW And 3Rd Place NW</li>
<li>Square No.3337: 2nd Street. NW And Oglethorpe Street NW</li>
<li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
<li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
</ul>
<h3>Second title</h3>
<ul>
<li>Square No.177: Fordham Street NW</li>
<li>Square No.186: 44th Street NW</li>
<li>Square No.138: 41St Street NW</li>
<li>Square No.997:Ingraham Georgia Avenue NW</li>
<li>Square No.314: Decatur Street NW</li>
<li>Square No.3292: Madison Street. NW And 3Rd Place NW</li>
<li>Square No.333: Oglethorpe Street NW</li>
<li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
<li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
</ul>
</div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
out = {}
for li in soup.select("li"):
heading = li.find_previous("h3").text.strip()
out.setdefault(heading, []).append(li.text)
print(out)
Prints:
{
"first title": [
"Square No.1477: Rodman Street. NW And Fordham Street NW",
"Square No.1586: Davenport Street. NW And 44th Street NW",
"Square No.1738: Garrison Street. NW And 41St Street NW",
"Square No.2997:Ingraham Street. NW And Georgia Avenue NW",
"Square No.3145: Illinois Avenue NW, And Decatur Street NW",
"Square No.3292: Madison Street. NW And 3Rd Place NW",
"Square No.3337: 2nd Street. NW And Oglethorpe Street NW",
"Square No.3337: Peabody Street. NW And 2nd Place NW",
"Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW",
],
"Second title": [
"Square No.177: Fordham Street NW",
"Square No.186: 44th Street NW",
"Square No.138: 41St Street NW",
"Square No.997:Ingraham Georgia Avenue NW",
"Square No.314: Decatur Street NW",
"Square No.3292: Madison Street. NW And 3Rd Place NW",
"Square No.333: Oglethorpe Street NW",
"Square No.3337: Peabody Street. NW And 2nd Place NW",
"Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW",
],
}