Home > Software engineering >  HTML parsing not working as expected using BeautifulSoup
HTML parsing not working as expected using BeautifulSoup

Time:02-25

I'm using Python 3 and the BeautifulSoup module, version 4.9.3. I'm trying to use this package to practice parsing some simple HTML.

The string I have is the following:

text = '''<li><p>Some text</p>is put here</li><li><p>And other text is put here</p></li>'''

I use BeautifulSoup as follows:

x = BeautifulSoup(text, "html.parser")

I then experiment with Beautiful Soup's functionality with the following script:

for li in x.find_all('li'):
    print(li)
    print(li.string)
    print(li.next_element)
    print(li.next_element)
    print(li.next_element.string)
    print("\n")

The results (at least for the first iteration) are unexpected:

<li><p>Some text</p>is put here</li>
None
<p>Some text</p>
Some text


<li><p>And other text is here</p></li>
And other text is here
<p>And other text is here</p>
And other text is here

Why is the string attribute of the first li tag None, whereas the string attribute of the inner p tag is not None?

Similarly, if I do:

x.find_all('li', string=re.compile('text'))

I only get one result (the 2nd tag).

But if I do:

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))

I get 2 results (both tags).

CodePudding user response:

Paraphrasing the doc:

  1. If a tag has only one child, and that child is a NavigableString, the child is made available as .string.
  2. If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child.
  3. If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None.

Let's apply these rules to your question:

Why is the string attribute of the first li tag None, whereas the string attribute of the inner p tag is not None?

The inner p tag satisfies rule #1; it has exactly one child, and that child is a NavigableString, so .string returns that child.

The first li satisfies rule #3; it has more than one child, so .string would be ambiguous.


Considering your second question, let's consult the doc for the string= argument to .find_all()

With string you can search for strings instead of tags. ... Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string.

Your first example:

x.find_all('li', string=re.compile('text'))
# [<li><p>And other text is put here</p></li>]

That searches for all of the li tags whose .string matches the regular expression. But we have already seen that the first li's .string is None, so it doesn't match.

Your second example:

for li in x.find_all('li'):
    print(li.find_all(string=re.compile('text')))
# ['Some text']
# ['And other text is put here']

This searches for all of the strings contained anywhere in each of the li trees. For the first tree, li.p.string exists and matches, even if li.string doesn't.

  • Related