Home > Software engineering >  Python Printing Arranged Output of extracted HTML tags
Python Printing Arranged Output of extracted HTML tags

Time:05-19

In the following HTML code, trying to extract AND organize the extracted output:

html_doc = """
<html>
<body>
            <ul >
              <li >
                <div >Birds Toys</div>
                <div >Toys belonging to the Bird Category</div>
                <ul >
                  <li >
                    <div >
                      <span >Eagle</span>
                      <span >$40.00</span>
                    </div>
                    <p >Eagle is the national bird of the US.</p>
                  </li>
                  <li >
                    <div >
                      <span >Parrot</span>
                      <span >$14.00</span>
                    </div>
                    <p >Parrot is found in tropical and subtropical region.</p>
                  </li>
                  <li >
                    <div >
                      <span >Owls</span>
                      <span >$23.00</span>
                    </div>
                    <p >Owls are nocturnal.</p>
                  </li>
                </ul>
                <ul >
                  <li >
                    <div >
                      <span >Kingfisher</span>
                      <span >$13.00</span>
                    </div>
                    <p >Kigfisher hunts in the water</p>
                  </li>
                  <li >
                    <div >
                      <span >Quail</span>
                      <span >$22.00</span>
                    </div>
                    <p ></p>
                  </li>
                </ul>
              </li>
            </ul>
            <ul >
              <li >
                <div >Reptiles Toys</div>
                <div >Toys belonging to Reptiles Category</div>
                <ul >
                  <li >
                    <div >
                      <span >Snake</span>
                      <span >$7.00</span>
                    </div>
                    <p >Snakes can be poisonous.</p>
                  </li>
                </ul>
                <ul >
                  <li >
                    <div >
                      <span >Lizard</span>
                      <span >$7.00</span>
                    </div>
                    <p >Lizards are found both at homes and in jungle</p>
                  </li>
                </ul>
              </li>
            </ul>
            <ul >
              <li >
                <div >Germs Toys</div>
                <div >Toys that belong to germs category</div>
                <ul >
                  <li >
                    <div >
                      <span >Bacteria</span>
                      <span >$12.95</span>
                    </div>
                    <p >Bacteria can cause tuberclausis</p>
                  </li>
                </ul>
                <ul >
                  <li >
                    <div >
                      <span >Protozoa</span>
                      <span >$11.95</span>
                    </div>
                    <p ></p>
                  </li>
                </ul>
                <ul >
                  <li >
                    <div >
                      <span >Virus</span>
                      <span >$12.95</span>
                    </div>
                    <p >Viruses are known to cause Corona, Aids, etc.</p>
                  </li>
                </ul>
              </li>
            </ul>
</body>
</html>
"""

I am able to successfully extract the div-class, span-class, p-class combinations using the following code:

soup = BeautifulSoup(html_doc)

with open("output.txt", "w") as output:

    # ITEM CLASS find a list of all div elements
    divitemscatg = soup.find_all('div', {'class' : 'h4 category-name section-title'})
    linesdivitemscatg = [span.get_text() for span in divitemscatg]
    print(linesdivitemscatg)
    
    # ITEM TITLE find a list of all span elements
    spansitemtitle = soup.find_all('span', {'class' : 'item-title'})
    linesitemtitle = [span.get_text() for span in spansitemtitle]
    print(linesitemtitle)
    
    # ITEM PRICE find a list of all span elements
    spansitemprice = soup.find_all('span', {'class' : 'item-price'})
    linesitemprice = [span.get_text() for span in spansitemprice]
    print(linesitemprice)
    
    # DESC find a list of all span elements
    spansitemdesc = soup.find_all('p', {'class' : 'description'})
    linesitemdesc = [span.get_text() for span in spansitemdesc]
    print(linesitemdesc)

The Output I am getting is:

['Birds Toys', 'Reptiles Toys', 'Germs Toys']
['Eagle', 'Parrot', 'Owls', 'Kingfisher', 'Quail', 'Snake', 'Lizard', 'Bacteria', 'Protozoa', 'Virus']
['$40.00', '$14.00', '$23.00', '$13.00', '$22.00', '$7.00', '$7.00', '$12.95', '$11.95', '$12.95']
['Eagle is the national bird of the US.', 'Parrot is found in tropical and subtropical region.', 'Owls are nocturnal.', 'Kigfisher hunts in the water', '', 'Snakes can be poisonous.', 'Lizards are found both at homes and in jungle', 'Bacteria can cause tuberclausis', '', 'Viruses are known to cause Corona, Aids, etc.']

But I need the output as differently organized as follows:

Birds Toys|Eagle|$40.00|Eagle is the national bird of the US.
Birds Toys|Parrot|$14.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|$23.00|Owls are nocturnal.
Birds Toys|Kingfisher|$13.00|Kigfisher hunts in the water
Birds Toys|Quail|$22.00|
Reptiles Toys|Snake|$7.00|Snakes can be poisonous.
Reptiles Toys|Lizard|$7.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|$12.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|$11.95|
Germs Toys|Virus|$12.95|Viruses are known to cause Corona, Aids, etc.

What changes are needed in the code above to achieve the latter. I am unable to get this arranged properly in the desired format.

Thanks in advance.

CodePudding user response:

You could get your goal this way - Select each menu-item, find its previous category and prepend it to your content:

soup=BeautifulSoup(html_doc)
with open("output.txt", "w") as output:


    for l in soup.select('.menu-items'):
        data = [
            l.find_previous('div',{'class':'h4'}).text,
            l.select_one('.item-title').text,
            l.select_one('.item-price').text,
            l.select_one('.description').text
        ]
        output.write('|'.join(data) '\n')
Output
Birds Toys|Eagle|$40.00|Eagle is the national bird of the US.
Birds Toys|Parrot|$14.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|$23.00|Owls are nocturnal.
Birds Toys|Kingfisher|$13.00|Kigfisher hunts in the water
Birds Toys|Quail|$22.00|
Reptiles Toys|Snake|$7.00|Snakes can be poisonous.
Reptiles Toys|Lizard|$7.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|$12.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|$11.95|
Germs Toys|Virus|$12.95|Viruses are known to cause Corona, Aids, etc.
  • Related