Home > Software engineering >  How to get text from hr tag using BeautifulSoup?
How to get text from hr tag using BeautifulSoup?

Time:02-25

This is an example of the HTML (I've tried to make it a lot neater than what it actually looks like):

<P>
random text
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span> 
<span style="font-size: 10px; margin-left: 20px;">
   <a style="color: #888; text-decoration: none;" title="Flag as offensive post"      
       href="/flag?a=248830&r=1">FLAG
   </a>
</span>

<hr> **THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>   
<span style="font-size: 10px; margin-left: 20px;">
    <a style="color: #888; text-decoration: none;" title="Flag as offensive post" 
       href="/flag?a=248830&r=2">FLAG
    </a>
</span>

<hr>**THIS IS THE TEXT I NEED**
<br>
<br>

<script type="text/javascript">

<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>

**THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i> 

I'm trying to get the text from the hr tag. However, doing

for i in soup.find_all('hr'):
    print(i.text)

does not work. Instead, I get a blank output.

I've also tried

soup.find('i').previousSibling

but that outputs a blank, I'm not sure if that's because there's <br> <br> before.

How can I get the **THIS IS THE TEXT I NEED**?

CodePudding user response:

The text you need isn't in an <hr> it's in a p. So you can get it like this:

soup = BeautifulSoup(doc, "html.parser")
ps = soup.findAll("p")
print(ps[0].getText())

Now considering that this prints:

random text


Anonymous
Nov 30 12:46pm

FLAG
   

 **THIS IS THE TEXT I NEED** 


Anonymous
Nov 30 3:40pm

FLAG
    

**THIS IS THE TEXT I NEED**




**THIS IS THE TEXT I NEED** 


Anonymous


Process finished with exit code 0

You'll need to parse out the text you need with something like:

import re

rawText = ps[0].getText()
matches = re.findall(r'\*\*.*\*\*',rawText)
for m in matches:
    print(m)

Which prints out:

**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**

But You'll need to fish out your text some other way because I doubt it is surrounded by asterixis. Edit: As a side not you can use soup.find instead of soup.findAll but I don't think that really matters.

CodePudding user response:

If your html document is just inside a <body> tag (which I would get by copy pasting your data), the text you want seem to be just inside the <body> tag, so like spare texts on a webpage. Then you can get them all calling find and setting the text and recursive parameters appropriately.

For this data:

data = """<P>
random text
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span> 
<span style="font-size: 10px; margin-left: 20px;">
   <a style="color: #888; text-decoration: none;" title="Flag as offensive post"      
       href="/flag?a=248830&r=1">FLAG
   </a>
</span>

<hr> **THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i> 
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>   
<span style="font-size: 10px; margin-left: 20px;">
    <a style="color: #888; text-decoration: none;" title="Flag as offensive post" 
       href="/flag?a=248830&r=2">FLAG
    </a>
</span>

<hr>**THIS IS THE TEXT I NEED**
<br>
<br>

<script type="text/javascript">

<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>

**THIS IS THE TEXT I NEED** 
<br>
<br>
<i>Anonymous</i> """

This code:

soup = BeautifulSoup(data)
out = [x.strip(' \n') for x in soup.find('body').find_all(text=True, recursive=False) if x not in ('\n', ' ')]

fetches the output below:

['**THIS IS THE TEXT I NEED**',
 '**THIS IS THE TEXT I NEED**',
 '**THIS IS THE TEXT I NEED**']
  • Related