Get text and links contained inside "li" from a bs4.element.ResultSet in Python-CodePudding

[<div ><ol>
<li><a href="https://www.geeksforgeeks.org/array-rotation/">Program for array rotation</a><li></ol></div>]

In the above <class 'bs4.element.ResultSet'>, I want to extract the text Program for array rotation and the link "https://www.geeksforgeeks.org/array-rotation/"

How can I do that using Python?

CodePudding user response：

If there is only a single link you like to get extracted you could use:

link = soup.select_one('li a[href]')['href']
text = soup.select_one('li a[href]').text
print(link, text)

But to go more generic, you could select all the <a> and than iterat the ResultSet with a dict comprehension to get unique href or text values, so also working for single items:

html = '''
<div ><ol>
<li><a href="https://www.geeksforgeeks.org/array-rotation/">Program for array rotation1</a><li>
<li><a href="https://www.geeksforgeeks.org/array-rotation/">Program for array rotation2</a><li></ol></div>
'''

soup = BeautifulSoup(html)

{a['href']:a.text for a in soup.select('div.rotation li a[href]')}

Out:

{'https://www.geeksforgeeks.org/array-rotation/': 'Program for array rotation2'}

or with list comprehension to get all variations:

[{a['href']:a.text} for a in soup.select('div.rotation li a[href]')]

Out:

[{'https://www.geeksforgeeks.org/array-rotation/': 'Program for array rotation1'},
{'https://www.geeksforgeeks.org/array-rotation/': 'Program for array rotation2'}]