How can I extract the below data through beautifulsoup?
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
I tried .findall
& .get_text
, however I am not able to extract the text that is in "htmlText" values
Output I am expecting is "some thing ORget exact data from here"
CodePudding user response:
Code and example in the online IDE (use the most readable):
from bs4 import BeautifulSoup
import lxml
html = """
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")
# BeautifulSoup inside BeautifulSoup
undreadable_soup = BeautifulSoup(BeautifulSoup(html, "lxml").select_one('htmlText').text, "lxml").p.text
print(undreadable_soup)
example_1 = BeautifulSoup(soup.select_one('htmlText').text, "lxml").p.text
print(text_1)
# wihtout hardcoded list slices
for result in soup.select("htmlText"):
example_2 = BeautifulSoup(result.text, "lxml").p.text
print(example_2)
# or one liner
example_3 = ''.join([BeautifulSoup(result.text, "lxml").p.text for result in soup.select("htmlText")])
print(example_3)
# output
'''
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
'''
CodePudding user response:
You could use BeautifulSoup twice, first extract the htmlText
element and then parse the contents. For example:
from bs4 import BeautifulSoup
html = """
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")
for tag1 in soup.find_all("tag1"):
cdata_html = tag1.htmltext.text
cdata_soup = BeautifulSoup(cdata_html, "lxml")
print(cdata_soup.p.text)
Which would display:
some thing ORget exact data from here