I'm trying to parse an url content with beautifulsoup after requests.get()
[not showed in code]. The parser being used is "html.parser"
. I have the following code snippet in a large script.
print(f"subheading : {subheading}")
print(f"type : {type(subheading)}")
print(f"dir : {dir(subheading)}")
if subheading.find('ul'):
print(f"Going for next level subheading search")
else:
c2 = subheading.find("li")
print(f"c2 : {c2}")
The first print statement gives me this in stdout:
subheading : <li><a href="/handbook/PRIN/1/1.html?date=2022-10-14&timeline=True">PRIN 1.1 Application and purpose</a></li>
I added a type check and the attribute list check, just to confirm whether I'm doing anything wrong. The second and third print statements gives me this :
type : <class 'bs4.element.Tag'>
dir : ['DEFAULT_INTERESTING_STRING_TYPES', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_find_all', '_find_one', '_is_xml', '_lastRecursiveChild', '_last_descendant', '_should_pretty_print', 'append', 'attrs', 'can_be_empty_element', 'cdata_list_attributes', 'childGenerator', 'children', 'clear', 'contents', 'decode', 'decode_contents', 'decompose', 'decomposed', 'default', 'descendants', 'encode', 'encode_contents', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText', 'get_attribute_list', 'get_text', 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 'interesting_string_types', 'isSelfClosing', 'is_empty_element', 'known_xml', 'name', 'namespace', 'next', 'nextGenerator', 'nextSibling', 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 'parent', 'parentGenerator', 'parents', 'parserClass', 'parser_class', 'prefix', 'preserve_whitespace_tags', 'prettify', 'previous', 'previousGenerator', 'previousSibling', 'previousSiblingGenerator', 'previous_element', 'previous_elements', 'previous_sibling', 'previous_siblings', 'recursiveChildGenerator', 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 'replace_with_children', 'select', 'select_one', 'setup', 'smooth', 'sourceline', 'sourcepos', 'string', 'strings', 'stripped_strings', 'text', 'unwrap', 'wrap']
But I can't do .find('li')
operation successfully inside the else part. c2
is always NoneType
.
I have also tried these :
c2 = subheading.a
But it is also NoneType
.
I've tried to do
c2 = subheading.find_all("li")
but then c2
is a vacant list.
My end goal is to first check if the li
tag exists and then find a
tag and if it exists, access the href
link and text
of the <a>
tag.
NOTE : I have tried to recreate the same thing in terminal which gave the correct li
tag. I have tried keeping subheading
in a string h
and then doing bs(h, 'html.parser')
on which .find('li')
works but while running the script it gives me NoneType. However the types of these two objects are different. The script one is <class 'bs4.element.Tag'>
but the one recreated in terminal is <class 'bs4.BeautifulSoup'>
. Does the different object type somehow opposes the attribute access or something similar?
Why are .find('li')
or other processes giving me nonetype or failing even though the tag exists? What am I doing wrong?
CodePudding user response:
I think you may need to clarify your code some, because I am getting different information than you.
pip3 install bs4
Then:
from bs4 import BeautifulSoup
s = """<li><a href="/handbook/PRIN/1/1.html?date=2022-10-14&timeline=True">PRIN 1.1 Application and purpose</a></li>"""
soup = BeautifulSoup(s)
soup.find("li")
# Returns the Correct LI.
If this doesnt solve your issue, you may have a problem with what you are actually attempting to find. Take another look at your string to confirm it is correct.
Documentation for BeautifulSoup is located at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ and will likely help you with ingesting the correct data format for querying.
CodePudding user response:
I found a comical way to bypass the nonetype error I was facing. As subheading
variable of type bs4.element.Tag
and on the other hand bs4.BeautifulSoup
type object was giving the correct li
tags, I thought of type-casting subheading
into string and then again parsing it with beautifulsoup so that its type gets changed to bs4.BeautifulSoup
and then doing .find('li')
works perfectly.
I changed my code to :
subheading_str = str(subheading)
subheading_soup = bs(subheading_str, "html.parser")
if subheading_soup.find("ul"):
print(f"Going for next level subheading search")
else:
c2 = subheading.find("li")
print(f"c2 : {c2}") # Not nonetype this time, gives correct result
if c2:
# next code part
NOTE - This might not be right/technically correct way to solve the issue but does the work for me.