Home > Back-end >  Extract data between head tags along with heading
Extract data between head tags along with heading

Time:08-24

I have an html page something as follow i want to extract the data after for each heading along with heading, i cant directly use find all with li tags as miss which li tags belong to which heading, I am new to scraping so not sure how this work i want output as something like heading 1 and all the li tags text from that heading till next and then next heading and its data

<div class = "some div class">
   <div class = "some other div">
     <h3> first title </h3>
    <ul>
        <li>Square No.1477:  Rodman Street. NW And Fordham Street NW</li>
        <li>Square No.1586:  Davenport Street. NW And 44th Street NW</li>
        <li>Square No.1738:  Garrison Street. NW And 41St Street NW</li>
        <li>Square No.2997:Ingraham Street. NW And Georgia Avenue NW</li>
        <li>Square No.3145: Illinois Avenue NW, And Decatur Street NW</li>
        <li>Square No.3292:  Madison Street. NW And 3Rd Place NW</li>
        <li>Square No.3337:  2nd Street. NW And Oglethorpe Street NW</li>
        <li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
        <li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
    </ul>
   <h3>Second title</h3>
    <ul>
        <li>Square No.177:  Fordham Street NW</li>
        <li>Square No.186:  44th Street NW</li>
        <li>Square No.138:  41St Street NW</li>
        <li>Square No.997:Ingraham Georgia Avenue NW</li>
        <li>Square No.314: Decatur Street NW</li>
        <li>Square No.3292:  Madison Street. NW And 3Rd Place NW</li>
        <li>Square No.333:  Oglethorpe Street NW</li>
        <li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
        <li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
    </ul>
   </div>
</div>

CodePudding user response:

If you want to group li tags with <h3> tags as a heading, you can use .find_previous:

from bs4 import BeautifulSoup


html = """\
<div class = "some div class">
   <div class = "some other div">
     <h3> first title </h3>
    <ul>
        <li>Square No.1477:  Rodman Street. NW And Fordham Street NW</li>
        <li>Square No.1586:  Davenport Street. NW And 44th Street NW</li>
        <li>Square No.1738:  Garrison Street. NW And 41St Street NW</li>
        <li>Square No.2997:Ingraham Street. NW And Georgia Avenue NW</li>
        <li>Square No.3145: Illinois Avenue NW, And Decatur Street NW</li>
        <li>Square No.3292:  Madison Street. NW And 3Rd Place NW</li>
        <li>Square No.3337:  2nd Street. NW And Oglethorpe Street NW</li>
        <li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
        <li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
    </ul>
   <h3>Second title</h3>
    <ul>
        <li>Square No.177:  Fordham Street NW</li>
        <li>Square No.186:  44th Street NW</li>
        <li>Square No.138:  41St Street NW</li>
        <li>Square No.997:Ingraham Georgia Avenue NW</li>
        <li>Square No.314: Decatur Street NW</li>
        <li>Square No.3292:  Madison Street. NW And 3Rd Place NW</li>
        <li>Square No.333:  Oglethorpe Street NW</li>
        <li>Square No.3337: Peabody Street. NW And 2nd Place NW</li>
        <li>Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW</li>
    </ul>
   </div>
</div>"""


soup = BeautifulSoup(html, "html.parser")

out = {}
for li in soup.select("li"):
    heading = li.find_previous("h3").text.strip()
    out.setdefault(heading, []).append(li.text)

print(out)

Prints:

{
    "first title": [
        "Square No.1477:  Rodman Street. NW And Fordham Street NW",
        "Square No.1586:  Davenport Street. NW And 44th Street NW",
        "Square No.1738:  Garrison Street. NW And 41St Street NW",
        "Square No.2997:Ingraham Street. NW And Georgia Avenue NW",
        "Square No.3145: Illinois Avenue NW, And Decatur Street NW",
        "Square No.3292:  Madison Street. NW And 3Rd Place NW",
        "Square No.3337:  2nd Street. NW And Oglethorpe Street NW",
        "Square No.3337: Peabody Street. NW And 2nd Place NW",
        "Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW",
    ],
    "Second title": [
        "Square No.177:  Fordham Street NW",
        "Square No.186:  44th Street NW",
        "Square No.138:  41St Street NW",
        "Square No.997:Ingraham Georgia Avenue NW",
        "Square No.314: Decatur Street NW",
        "Square No.3292:  Madison Street. NW And 3Rd Place NW",
        "Square No.333:  Oglethorpe Street NW",
        "Square No.3337: Peabody Street. NW And 2nd Place NW",
        "Square No.5441: Livingston Street NW, Nevada Avenue. NW And Legation Street NW",
    ],
}
  • Related