I am using BeautifulSoup in a project and noticed that it removes leading spaces. For example:
from bs4 import BeautifulSoup
sample = " Test"
soup = BeautifulSoup(sample, features="lxml")
[s.extract() for s in soup(["style", "script", "[document]", "head", "title"])]
print(soup.getText(strip=False))
Returns "Test"
I tried setting the strip option to "False" but it did not help and I cannot find any discussion of this behavior anywhere. This is a MWE but the goal is to take HTML-formatted input and print the plain text.
CodePudding user response:
To avoid the leading whitespace, you can use html.parser
instead of lxml
as your parser:
soup = BeautifulSoup(html_doc, 'html.parser')
See the BeautifulSoup documentation on using different parser:
But if the document is not perfectly-formed, different parsers will give different results...