This is an example of the HTML (I've tried to make it a lot neater than what it actually looks like):
<P>
random text
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=1">FLAG
</a>
</span>
<hr> **THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=2">FLAG
</a>
</span>
<hr>**THIS IS THE TEXT I NEED**
<br>
<br>
<script type="text/javascript">
<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>
**THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i>
I'm trying to get the text from the hr
tag. However, doing
for i in soup.find_all('hr'):
print(i.text)
does not work. Instead, I get a blank output.
I've also tried
soup.find('i').previousSibling
but that outputs a blank, I'm not sure if that's because there's <br> <br>
before.
How can I get the **THIS IS THE TEXT I NEED**
?
CodePudding user response:
The text you need isn't in an <hr>
it's in a p. So you can get it like this:
soup = BeautifulSoup(doc, "html.parser")
ps = soup.findAll("p")
print(ps[0].getText())
Now considering that this prints:
random text
Anonymous
Nov 30 12:46pm
FLAG
**THIS IS THE TEXT I NEED**
Anonymous
Nov 30 3:40pm
FLAG
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
Anonymous
Process finished with exit code 0
You'll need to parse out the text you need with something like:
import re
rawText = ps[0].getText()
matches = re.findall(r'\*\*.*\*\*',rawText)
for m in matches:
print(m)
Which prints out:
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
**THIS IS THE TEXT I NEED**
But You'll need to fish out your text some other way because I doubt it is surrounded by asterixis. Edit: As a side not you can use soup.find
instead of soup.findAll
but I don't think that really matters.
CodePudding user response:
If your html document is just inside a <body>
tag (which I would get by copy pasting your data), the text you want seem to be just inside the <body>
tag, so like spare texts on a webpage. Then you can get them all calling find
and setting the text
and recursive
parameters appropriately.
For this data:
data = """<P>
random text
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 12:46pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=1">FLAG
</a>
</span>
<hr> **THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i>
<span style="font-size: 10px; margin-left: 10px; color: #994;">Nov 30 3:40pm</span>
<span style="font-size: 10px; margin-left: 20px;">
<a style="color: #888; text-decoration: none;" title="Flag as offensive post"
href="/flag?a=248830&r=2">FLAG
</a>
</span>
<hr>**THIS IS THE TEXT I NEED**
<br>
<br>
<script type="text/javascript">
<script type="text/javascript" src="//cdn.chitika.net/getads.js" async></script>
**THIS IS THE TEXT I NEED**
<br>
<br>
<i>Anonymous</i> """
This code:
soup = BeautifulSoup(data)
out = [x.strip(' \n') for x in soup.find('body').find_all(text=True, recursive=False) if x not in ('\n', ' ')]
fetches the output below:
['**THIS IS THE TEXT I NEED**',
'**THIS IS THE TEXT I NEED**',
'**THIS IS THE TEXT I NEED**']