I'm using Beautifulsoup Soap to extract visible text in webpage, so I tried to implement the following solution:
def filter_visible_texts(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def extract_visible_text(soup):
visible_texts = soup.find_all(text=True)
print(visible_texts)
filtered_visible_texts = filter(filter_visible_texts, visible_texts)
return set(text.strip() for text in filtered_visible_texts)
The problem is that it's critical to me to preserve order.
The documentation of Beautifulsoup doesn't say anything regarding optional parameter to preserve order. Isn't this possible?
CodePudding user response:
Your problem is the set
structure. According to the documentation it's an unordered collection, i.e. you'll never be sure you get the same order again.
For keeping order, you could use a dict
with the index as key. To remove duplicates (if needed), you'd need to write a little loop.
I built a little test html since I don't know what your website looks like to check if the level in the XML-tree does affect order. What I noticed is, the order is correct, from top to bottom as they appear in the html file.
<html>
<body>
<div/>
<div >
<div>
<a href="example.com" title="Title of the link">link 1</a>
</div>
<div>
<div>text inside div</div>
</div>
<a href="example.com" title="Some more title">link 2</a>
</div>
</body>
</html>
The script used to extract, basically your script a without the filtering
from bs4 import BeautifulSoup
with open("test_order.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
text = soup.find_all(text=True)
print(text)
print(set(text.strip() for text in text))
And the output:
['\n', '\n', '\n', '\n', '\n', 'link 1', '\n', '\n', '\n', 'text inside div', '\n', '\n', 'link 2', '\n', '\n', '\n']
{'', 'link 1', 'link 2', 'text inside div'}
As you can see, in the first output, the order is link1, text, link2. After converting to a set, the order changes.
For your example, it may be the case that some text appears farther to the top of the page because it's styled this way using CSS but in the html itself, it is defined at a later point.