Home > database >  Beautiful Soup only get deepest unique descendants from tag
Beautiful Soup only get deepest unique descendants from tag

Time:10-22

I am trying to get a list of the main content from web articles. I can get all descendants, but I only need the deepest children. My current code gets the descendants, but then I have to sort my list.

    # html = BeautifulSoup(r.text, "html.parser")
    def parse_content(html):
      content_list = []
      main_content = html.find("div", class_ = "pf-content") # change class based on the website

      for i in main_content.descendants:
        content_list.append(i)

      """ 
      # .descendants recursively iterates over all children, so a tag like
      # <em>italicized word</em> is appended both as <em>italicized word</em>
      # and "italicized word". The below code removes the string without the tag.
      final_list = content_list
      for c, i in enumerate(content_list):
        is_current_navigable_string = isinstance(i, NavigableString)
        is_previous_tag = isinstance(content_list[c-1], Tag)
        correct_types = is_current_navigable_string and is_previous_tag

        if correct_types and i == str(content_list[c-1].text):
          final_list.remove(i)
      """

      return content_list

CodePudding user response:

If you really just wanted the deepest descendants, you'd just not append to content_list anything that had descendants (because, by definition, a deepest descendants has no children/descendants) - something like

        for i in main_content.descendants:
            if i.name is None or list(i.children) == []:
                content_list.append(i)

but also,

The below code removes the string without the tag

so I'm assuming that you want the tag containing the deepest-descendant-string if there's nothing else in the tag, and outside of those strings you also want the actual deepest-descendants.

I'm fond of list comprehension, so my suggested method is:

return [
    d for d in list(main_content.descendants) if 
    (isinstance(d, NavigableString) and d.parent.find() is not None) or
    (isinstance(d, Tag) and d.find() is None)
]

So if main_content was

<h1>Header</h1>
<div><b>Bold text</b><i id="empty_ital"></i> other text</div>
<p>Some more <span>things</span></p> free text.

then

[<h1>Header</h1>, <b>Bold text</b>, <i id="empty_ital"></i>, ' other text', 'Some more ', <span>things</span>, ' free text.']

would be returned

  • Related