Home > Back-end >  how to get a html text inside tag through beautiful soup python
how to get a html text inside tag through beautiful soup python

Time:11-11

How can I extract the below data through beautifulsoup?

<Tag1>
                    <message code="able to extract text from here"/>
                    <text value="able to extract text that is here"/>
                    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>

I tried .findall & .get_text, however I am not able to extract the text that is in "htmlText" values

Output I am expecting is "some thing ORget exact data from here"

CodePudding user response:

Code and example in the online IDE (use the most readable):

from bs4 import BeautifulSoup
import lxml

html = """
<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>
"""

soup = BeautifulSoup(html, "lxml")


#  BeautifulSoup inside BeautifulSoup
undreadable_soup = BeautifulSoup(BeautifulSoup(html, "lxml").select_one('htmlText').text, "lxml").p.text
print(undreadable_soup)


example_1 = BeautifulSoup(soup.select_one('htmlText').text, "lxml").p.text
print(text_1)


# wihtout hardcoded list slices
for result in soup.select("htmlText"):
    example_2 = BeautifulSoup(result.text, "lxml").p.text
    print(example_2)


# or one liner
example_3 = ''.join([BeautifulSoup(result.text, "lxml").p.text for result in soup.select("htmlText")])
print(example_3)


# output
'''
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
'''

CodePudding user response:

You could use BeautifulSoup twice, first extract the htmlText element and then parse the contents. For example:

from bs4 import BeautifulSoup

html = """
<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")

for tag1 in soup.find_all("tag1"):
    cdata_html = tag1.htmltext.text
    cdata_soup = BeautifulSoup(cdata_html, "lxml")
    
    print(cdata_soup.p.text)

Which would display:

some thing ORget exact data from here
  • Related