I would like to extract the content of a tag without this tag itself and keeping all the tags inside.
html_text = "<TD WIDTH=60%>Portsmouth - Cherbourg<BR/>Portsmouth - Santander<BR/></TD>"
soup = BeautifulSoup(html_text, 'html.parser')
soup_list = soup.find_all("td")
soup_object = soup_list[0]
text = soup_object.getText()
print(soup_object)
I get:
<td width="60%">Portsmouth - Cherbourg<br/>Portsmouth - Santander<br/></td>
But I want:
Portsmouth - Cherbourg<br/>Portsmouth - Santander<br/>
Using :
soup_object.getText()
... return:
Portsmouth - CherbourgPortsmouth - Santander
But it is not what I want.
I know I can get the full content of the tag from a regex:
re.search("<td.?>(.)</td>", str(soup_object)).group(1)
... but I use BeautifulSoup so I don't have to type this kind of code.
A last thing:
soup_object.contents
... does not return what i want:
['Portsmouth - Cherbourg', <br/>, 'Portsmouth - Santander', <br/>]
Am I missing out on a Beautifulsoup feature?
CodePudding user response:
You might use decode_contents
as follows
from bs4 import BeautifulSoup
html_text = "<TD WIDTH=60%>Portsmouth - Cherbourg<BR/>Portsmouth - Santander<BR/></TD>"
soup = BeautifulSoup(html_text, 'html.parser')
soup_list = soup.find_all("td")
soup_object = soup_list[0]
inner_html = soup_object.decode_contents()
print(inner_html)
output
Portsmouth - Cherbourg<br/>Portsmouth - Santander<br/>
CodePudding user response:
TIL about soup.find("td").decode_contents()
method, and that does seem like the simplest and best method, but even before, there were several ways to get the "innerHTML".
Since you already noticed the contents
''.join(soup_object.contents)
or, since we know prettify produces a certain format,
'\n'.join(soup_object.prettify().split('\n')[1:-1])
although I will admit that this is rather like using regex after all...
Btw, find()
is pretty much the same as find_all()[0]
- wasn't sure you knew, but I find using just find()
much more convenient in these situations