I'm using Python 3 and the BeautifulSoup module, version 4.9.3. I'm trying to use this package to practise parsing some simple HTML.
The string I have is the following:
text = '''<li><p>Some text</p>is put here</li><li><p>And other text is put here</p></li>'''
I use BeautifulSoup as follows:
x = BeautifulSoup(text, "html.parser")
I then experiment with Beautiful Soup's functionality with the following script:
for li in x.find_all('li'):
print(li)
print(li.string)
print(li.next_element)
print(li.next_element)
print(li.next_element.string)
print("\n")
The results (at least for the first iteration) are unexpected:
<li><p>Some text</p>is put here</li>
None
<p>Some text</p>
Some text
<li><p>And other text is here</p></li>
And other text is here
<p>And other text is here</p>
And other text is here
Why is the string
attribute of the first li
tag None, whereas the string
attribute of the inner p
tag is not None?
Similarly, if I do:
x.find_all('li', string=re.compile('text'))
I only get one result (the 2nd tag).
But if I do:
for li in x.find_all('li'):
print(li.find_all(string=re.compile('text')))
I get 2 results (both tags).
CodePudding user response:
Paraphrasing the doc:
- If a tag has only one child, and that child is a
NavigableString
, the child is made available as.string
.- If a tag’s only child is another tag, and that tag has a
.string
, then the parent tag is considered to have the same.string
as its child.- If a tag contains more than one thing, then it’s not clear what
.string
should refer to, so.string
is defined to be None.
Let's apply these rules to your question:
Why is the string attribute of the first
li
tagNone
, whereas the string attribute of the innerp
tag is notNone
?
The inner p
tag satisfies rule #1; it has exactly one child, and that child is a NavigableString
, so .string
returns that child.
The first li
satisfies rule #3; it has more than one child, so .string
would be ambiguous.
Considering your second question, let's consult the doc for the string=
argument to .find_all()
With
string
you can search for strings instead of tags. ... Althoughstring
is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose.string
matches your value for string.
Your first example:
x.find_all('li', string=re.compile('text'))
# [<li><p>And other text is put here</p></li>]
That searches for all of the li
tags whose .string
matches the regular expression. But we have already seen that the first li
's .string
is None
, so it doesn't match.
Your second example:
for li in x.find_all('li'):
print(li.find_all(string=re.compile('text')))
# ['Some text']
# ['And other text is put here']
This searches for all of the strings contained anywhere in each of the li
trees. For the first tree, li.p.string
exists and matches, even if li.string
doesn't.