Home > Enterprise >  `.find('li')` on `bs4.element.Tag` object gives None even though `<li>` tag exists i
`.find('li')` on `bs4.element.Tag` object gives None even though `<li>` tag exists i

Time:10-18

I'm trying to parse an url content with beautifulsoup after requests.get() [not showed in code]. The parser being used is "html.parser". I have the following code snippet in a large script.

print(f"subheading : {subheading}")
print(f"type : {type(subheading)}")
print(f"dir : {dir(subheading)}")
if subheading.find('ul'):
    print(f"Going for next level subheading search")
else:
    c2 = subheading.find("li")
    print(f"c2 : {c2}")

The first print statement gives me this in stdout:

subheading : <li><a href="/handbook/PRIN/1/1.html?date=2022-10-14&amp;timeline=True">PRIN 1.1 Application and purpose</a></li>

I added a type check and the attribute list check, just to confirm whether I'm doing anything wrong. The second and third print statements gives me this :

type : <class 'bs4.element.Tag'>
dir : ['DEFAULT_INTERESTING_STRING_TYPES', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_find_all', '_find_one', '_is_xml', '_lastRecursiveChild', '_last_descendant', '_should_pretty_print', 'append', 'attrs', 'can_be_empty_element', 'cdata_list_attributes', 'childGenerator', 'children', 'clear', 'contents', 'decode', 'decode_contents', 'decompose', 'decomposed', 'default', 'descendants', 'encode', 'encode_contents', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText', 'get_attribute_list', 'get_text', 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 'interesting_string_types', 'isSelfClosing', 'is_empty_element', 'known_xml', 'name', 'namespace', 'next', 'nextGenerator', 'nextSibling', 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 'parent', 'parentGenerator', 'parents', 'parserClass', 'parser_class', 'prefix', 'preserve_whitespace_tags', 'prettify', 'previous', 'previousGenerator', 'previousSibling', 'previousSiblingGenerator', 'previous_element', 'previous_elements', 'previous_sibling', 'previous_siblings', 'recursiveChildGenerator', 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 'replace_with_children', 'select', 'select_one', 'setup', 'smooth', 'sourceline', 'sourcepos', 'string', 'strings', 'stripped_strings', 'text', 'unwrap', 'wrap']

But I can't do .find('li') operation successfully inside the else part. c2 is always NoneType.

I have also tried these :

c2 = subheading.a

But it is also NoneType.

I've tried to do

c2 = subheading.find_all("li")

but then c2 is a vacant list.

My end goal is to first check if the li tag exists and then find a tag and if it exists, access the href link and text of the <a> tag.

NOTE : I have tried to recreate the same thing in terminal which gave the correct li tag. I have tried keeping subheading in a string h and then doing bs(h, 'html.parser') on which .find('li') works but while running the script it gives me NoneType. However the types of these two objects are different. The script one is <class 'bs4.element.Tag'> but the one recreated in terminal is <class 'bs4.BeautifulSoup'>. Does the different object type somehow opposes the attribute access or something similar?

Why are .find('li') or other processes giving me nonetype or failing even though the tag exists? What am I doing wrong?

CodePudding user response:

I think you may need to clarify your code some, because I am getting different information than you.

pip3 install bs4

Then:

from bs4 import BeautifulSoup
s = """<li><a href="/handbook/PRIN/1/1.html?date=2022-10-14&amp;timeline=True">PRIN 1.1 Application and purpose</a></li>"""
soup = BeautifulSoup(s)
soup.find("li")
# Returns the Correct LI.

If this doesnt solve your issue, you may have a problem with what you are actually attempting to find. Take another look at your string to confirm it is correct.

Documentation for BeautifulSoup is located at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ and will likely help you with ingesting the correct data format for querying.

CodePudding user response:

I found a comical way to bypass the nonetype error I was facing. As subheading variable of type bs4.element.Tag and on the other hand bs4.BeautifulSoup type object was giving the correct li tags, I thought of type-casting subheading into string and then again parsing it with beautifulsoup so that its type gets changed to bs4.BeautifulSoup and then doing .find('li') works perfectly.

I changed my code to :

subheading_str = str(subheading)
subheading_soup = bs(subheading_str, "html.parser")
if subheading_soup.find("ul"):
    print(f"Going for next level subheading search")
else:
    c2 = subheading.find("li")
    print(f"c2 : {c2}") # Not nonetype this time, gives correct result
    if c2:
       # next code part

NOTE - This might not be right/technically correct way to solve the issue but does the work for me.

  • Related