Hi everyone I have extracted some html elements from a webiste using beautifulsoup and find_all. Therefore I have received a list of list of bs4.elements.ResultSet like this:
[[<li >neu</li>],
[<li >neu</li>],
[<li >neu</li>, <li >Terrasse</li>],
[<li >neu</li>,
<li >Terrasse</li>,
<li >Parkplatz</li>]
I would now like to retrieve the text within the bs4 elements and keep the same format of list. I have been experimenting with creating two loops.
fet = []
for feat in features_bs:
for fets in feat:
fet.append(fets.text)
features.append(fet)
The first loop looks at every list (feat) within the original list (features_bs). The second looks at every elements (fets) in every inside list (feats) and then changes the element to text. I would now have liked to append the text back into an empty list (fet), however I would like to keep the same format as before with lists inside lists. At the moment I only get the text inside the first loop like this:
['neu',
'neu',
'neu',
'Terrasse',
'neu',
'Terrasse',
'Parkplatz']
However I would like the output to be:
[['neu'],
['neu'],
['neu','Terrase'],
['neu'],
['Terrase']
['Parkplatz']]
Thanks for the help in advance.
CodePudding user response:
Near to your goal - but there is one temporary list missing:
fet = []
for feat in features_bs:
el = []
for fets in feat:
el.append(fets.text)
fet.append(el)
fet
Output:
[['neu'], ['neu'], ['neu', 'Terrasse'], ['neu'], ['Terrasse'], ['Parkplatz']]
You could also lean your process and transform it directly into your expected format:
from bs4 import BeautifulSoup
html = '''
<ul>
<li >neu</li>
</ul>
<ul>
<li >neu</li>
</ul>
<ul>
<li >neu</li>, <li >Terrasse</li>
</ul>
<ul>
<li >neu</li>
</ul>
<ul>
<li >Terrasse</li>
</ul>
<ul>
<li >Parkplatz</li>
</ul>
'''
soup = BeautifulSoup(html)
data = []
for ul in soup.find_all('ul'):
el = []
for e in ul.find_all('li'):
el.append(e)
data.append(el)
data
Output:
[['neu'], ['neu'], ['neu', 'Terrasse'], ['neu'], ['Terrasse'], ['Parkplatz']]