I want to use BeautifulSoup to get the text from an HTML string. While get_text()
's separator argument is nice, I would like to use different separators for different tags (or not use any at all for others).
As an example, consider the HTML:
<p>This is some paragraph text. With a <a href="example.com">link</a>.</p>
<div>This is another paragraph.</div>
Dummy code:
from bs4 import BeautifulSoup
string = '<p>This is some paragraph text. With a <a href="example.com">link</a>.</p>\n<div>This is another paragraph.</div>'
soup = BeautifulSoup(string)
text = soup.get_text('\n', strip=True)
print(text)
Using get_text('\n')
outputs
This is some paragraph text. With a
link
.
This is another paragraph.
But the desired output would be
This is some paragraph text. With a link.
This is another paragraph.
Is there a way to use get_text()
and use the '\n' string as a separator for most tags and no separators for "inline" tags like <a>
or <b>
?
Note that the HTML I am parsing is not consistent so I can't use a function that corrects this behavior afterwards.
EDIT:
The reason for using a separator as an argument in get_text()
is that the input is not guaranteed to have a newline between the two paragraphs.
If the example HTML was
<p>This is some paragraph text. With a <a href="example.com">link</a>.</p><div>This is another paragraph.</div>
the output still has to have the contents of <p>
tags separated somehow.
EDIT 2: Added different tags to the examples.
CodePudding user response:
Try this:
from bs4 import BeautifulSoup
string = '<p>This is some paragraph text. With a <a href="example.com">link</a>.</p>\n<p>This is another paragraph.</p>'
soup = BeautifulSoup(string, "html.parser")
print(soup.getText())
Output:
This is some paragraph text. With a link.
This is another paragraph.
EDIT:
Try this for both strings:
from bs4 import BeautifulSoup
string_1 = '<p>This is some paragraph text. With a <a href="example.com">link</a>.</p>\n<p>This is another paragraph.</p>'
string_2 = '<p>This is some paragraph text. With a <a href="example.com">link</a>.</p><p>This is another paragraph.</p>'
soup = [p.getText().strip() for p in BeautifulSoup(string_2, "html.parser").find_all("p")]
print("\n".join(soup))
In both cases it should produce:
This is some paragraph text. With a link.
This is another paragraph.
CodePudding user response:
This should get you all elements, separated, as list elements - and you can then choose/slice/separate/ do stuff to them later:
from bs4 import BeautifulSoup
html = '<p>This is some paragraph text. With a <a href="example.com">link</a>.</p>\n<p>This is another paragraph.</p>'
soup = BeautifulSoup(html, 'html.parser')
print([x.get_text() for x in soup.find_all() if len(x.text) > 0])
Result in terminal:
['This is some paragraph text. With a link.', 'link', 'This is another paragraph.']
This is nothing special, nor dramatically different from baduker's response. My suspicion is - your question is in fact an XY problem, so it would be better if you could explain your end goal, as well as confirming the url you are trying to scrape. Also, BeautifulSoup documentation might help: https://beautiful-soup-4.readthedocs.io/en/latest/index.html