Home > database >  Using elementree for flexible processing of HTML data
Using elementree for flexible processing of HTML data

Time:12-10

I'm using elementree to extract data from HTML in a format that has evolved in structure over time (see samples below).

I'm currently doing this by using iterfind to find different matching blocks of structure (font/b, b/font, font)

But, I've noticed there is a general pattern. Regardless of the specific set of HTML elements in use, the ultimate inner text of the first div child is the color, the second child is the pet-type, and the third child is the name.

Is there a generic way of doing this via elementree? That would make my code simpler, and possibly be more future-proof.

<div>
  <font><b>Brown</b></font><a>Cat</a><font><b>Larry</b></font>
</div>
<div>
  <b><font>White</font></b><i><a>Poodle</a></i><b><font>Foxy</font></b>
</div>
<div>
  <font><i>Tabby</i></font><a><i>Cat</i></a><font>Tempi</font>
</div>

CodePudding user response:

How about something like this:

pets = """<body><div>
  <font><b>Brown</b></font><a>Cat</a><font><b>Larry</b></font>
</div>
<div>
  <b><font>White</font></b><i><a>Poodle</a></i><b><font>Foxy</font></b>
</div>
<div>
  <font><i>Tabby</i></font><a><i>Cat</i></a><font>Tempi</font>
</div></body>"""

animals = []
doc = ET.fromstring(pets)
for pet in doc.findall('.//div'):
    animals.append([animal.text  for animal in pet.findall('.//*') if animal.text]  )

animals

Output:

[['Brown', 'Cat', 'Larry'],
 ['White', 'Poodle', 'Foxy'],
 ['Tabby', 'Cat', 'Tempi']]

CodePudding user response:

This code appears to work:

        items = div.itertext()
        textblocks = []
        for item in items:
            trimmed = item.strip()
            if len(trimmed) > 0:
                textblocks.append(trimmed)
        color = textblocks[0]
        pet_type = textblocks[1]
        name = textblocks[2]
        print(color   ', '   pet_type   ', '   name)

I welcome any improvements to the code

  • Related