I'm using elementree to extract data from HTML in a format that has evolved in structure over time (see samples below).
I'm currently doing this by using iterfind
to find different matching blocks of structure (font/b
, b/font
, font
)
But, I've noticed there is a general pattern. Regardless of the specific set of HTML elements in use, the ultimate inner text of the first div child is the color, the second child is the pet-type, and the third child is the name.
Is there a generic way of doing this via elementree
? That would make my code simpler, and possibly be more future-proof.
<div>
<font><b>Brown</b></font><a>Cat</a><font><b>Larry</b></font>
</div>
<div>
<b><font>White</font></b><i><a>Poodle</a></i><b><font>Foxy</font></b>
</div>
<div>
<font><i>Tabby</i></font><a><i>Cat</i></a><font>Tempi</font>
</div>
CodePudding user response:
How about something like this:
pets = """<body><div>
<font><b>Brown</b></font><a>Cat</a><font><b>Larry</b></font>
</div>
<div>
<b><font>White</font></b><i><a>Poodle</a></i><b><font>Foxy</font></b>
</div>
<div>
<font><i>Tabby</i></font><a><i>Cat</i></a><font>Tempi</font>
</div></body>"""
animals = []
doc = ET.fromstring(pets)
for pet in doc.findall('.//div'):
animals.append([animal.text for animal in pet.findall('.//*') if animal.text] )
animals
Output:
[['Brown', 'Cat', 'Larry'],
['White', 'Poodle', 'Foxy'],
['Tabby', 'Cat', 'Tempi']]
CodePudding user response:
This code appears to work:
items = div.itertext()
textblocks = []
for item in items:
trimmed = item.strip()
if len(trimmed) > 0:
textblocks.append(trimmed)
color = textblocks[0]
pet_type = textblocks[1]
name = textblocks[2]
print(color ', ' pet_type ', ' name)
I welcome any improvements to the code