Home > front end >  extract the content of a html tag with all tags inside from beautifulsoup
extract the content of a html tag with all tags inside from beautifulsoup

Time:10-29

I would like to extract the content of a tag without this tag itself and keeping all the tags inside.

html_text = "<TD WIDTH=60%>Portsmouth - Cherbourg<BR/>Portsmouth - Santander<BR/></TD>"
soup = BeautifulSoup(html_text, 'html.parser')
soup_list = soup.find_all("td")
soup_object = soup_list[0]
text = soup_object.getText()
print(soup_object)

I get:

<td width="60%">Portsmouth - Cherbourg<br/>Portsmouth - Santander<br/></td>

But I want:

Portsmouth - Cherbourg<br/>Portsmouth - Santander<br/>

Using :

soup_object.getText()

... return:

Portsmouth - CherbourgPortsmouth - Santander

But it is not what I want.

I know I can get the full content of the tag from a regex:

re.search("<td.?>(.)</td>", str(soup_object)).group(1)

... but I use BeautifulSoup so I don't have to type this kind of code.

A last thing:

soup_object.contents

... does not return what i want:

['Portsmouth - Cherbourg', <br/>, 'Portsmouth - Santander', <br/>]

Am I missing out on a Beautifulsoup feature?

CodePudding user response:

You might use decode_contents as follows

from bs4 import BeautifulSoup
html_text = "<TD WIDTH=60%>Portsmouth - Cherbourg<BR/>Portsmouth - Santander<BR/></TD>"
soup = BeautifulSoup(html_text, 'html.parser')
soup_list = soup.find_all("td")
soup_object = soup_list[0]
inner_html = soup_object.decode_contents()
print(inner_html)

output

Portsmouth - Cherbourg<br/>Portsmouth - Santander<br/>

CodePudding user response:

TIL about soup.find("td").decode_contents() method, and that does seem like the simplest and best method, but even before, there were several ways to get the "innerHTML".

Since you already noticed the contents

''.join(soup_object.contents)

or, since we know prettify produces a certain format,

'\n'.join(soup_object.prettify().split('\n')[1:-1])

although I will admit that this is rather like using regex after all...


Btw, find() is pretty much the same as find_all()[0] - wasn't sure you knew, but I find using just find() much more convenient in these situations

  • Related