I am trying to get a list of the main content from web articles. I can get all descendants, but I only need the deepest children. My current code gets the descendants, but then I have to sort my list.
# html = BeautifulSoup(r.text, "html.parser")
def parse_content(html):
content_list = []
main_content = html.find("div", class_ = "pf-content") # change class based on the website
for i in main_content.descendants:
content_list.append(i)
"""
# .descendants recursively iterates over all children, so a tag like
# <em>italicized word</em> is appended both as <em>italicized word</em>
# and "italicized word". The below code removes the string without the tag.
final_list = content_list
for c, i in enumerate(content_list):
is_current_navigable_string = isinstance(i, NavigableString)
is_previous_tag = isinstance(content_list[c-1], Tag)
correct_types = is_current_navigable_string and is_previous_tag
if correct_types and i == str(content_list[c-1].text):
final_list.remove(i)
"""
return content_list
CodePudding user response:
If you really just wanted the deepest descendants, you'd just not append to content_list
anything that had descendants (because, by definition, a deepest descendants has no children/descendants) - something like
for i in main_content.descendants:
if i.name is None or list(i.children) == []:
content_list.append(i)
but also,
The below code removes the string without the tag
so I'm assuming that you want the tag containing the deepest-descendant-string if there's nothing else in the tag, and outside of those strings you also want the actual deepest-descendants.
I'm fond of list comprehension, so my suggested method is:
return [
d for d in list(main_content.descendants) if
(isinstance(d, NavigableString) and d.parent.find() is not None) or
(isinstance(d, Tag) and d.find() is None)
]
So if main_content
was
<h1>Header</h1>
<div><b>Bold text</b><i id="empty_ital"></i> other text</div>
<p>Some more <span>things</span></p> free text.
then
[<h1>Header</h1>, <b>Bold text</b>, <i id="empty_ital"></i>, ' other text', 'Some more ', <span>things</span>, ' free text.']
would be returned