Extracting a string containing a `<br>` using BeautifulSoup returns `None`-CodePudding

I'm using BeautifulSoup, and I need to get the xxx string from the following line:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<p > xxx <br/> yyy <br/></p>',
                      'html.parser')

Usually, I would do the following:

one_a_tag = soup.p
t = one_a_tag.string
t

But that doesn't work, it returns None. However, if I delete <br/> yyy <br/> the code starts working. How do I extract xxx from the initial line?

CodePudding user response：

Try using .strings

x_and_y = list(soup.p.strings)
print(x_and_y)

Output: [' xxx ', ' yyy ']

.strings is a generator so the list() call is needed, but you also can use a for loop

CodePudding user response：

I'm getting the output as follows:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<p > xxx <br/> yyy <br/></p>','html.parser')


tag= soup.select_one('p.object-attr-value').text
print(tag.split()[0])

Output:

xxx

CodePudding user response：

The reason why you get None as output when using .string is: (from the documentation)

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

So, to get the text xxx, using your example, you can use .find() and pass text=True as an argument:

from bs4 import BeautifulSoup


soup = BeautifulSoup(
    '<p > xxx <br/> yyy <br/></p>', 'html.parser'
)

one_a_tag = soup.p
print(one_a_tag.find(text=True).strip())