Home > database >  How can i extract Href and title from this HTML
How can i extract Href and title from this HTML

Time:04-19

my bs4.element.ResultSet has this format:

    [<h3 >
    <a href="someLink" title="someTitle">SomeTitle</a>
    </h3>,
    <h3 >
    <a href="OtherLink" title="OtherTitle">OtherTitle</a>
    </h3>]

and i want to be able to extract and save in tuple [(title,href),(title2, href2)] but i cant seem to do so

my closest attempt was

    link = soup.find('h3',class_='foo1').find('a').get('title')
    print(link)

but that only returns the first element of the 2 or more how can i successfully extract each href and title

CodePudding user response:

Select your elements more specific e.g. with css selectors and iterate over your ResultSet to get the attributes of each of them as list of tuples:

[(a.get('title'),a.get('href')) for a in soup.select('h3 a[href][title]')]
Example
from bs4 import BeautifulSoup
html = '''
<h3 >
    <a href="someLink" title="someTitle">SomeTitle</a>
</h3>
<h3 >
    <a href="OtherLink" title="OtherTitle">OtherTitle</a>
</h3>
'''
soup = BeautifulSoup(html)

[(a.get('title'),a.get('href')) for a in soup.select('h3 a[href]')]

Output

[('someTitle', 'someLink'), ('OtherTitle', 'OtherLink')]

CodePudding user response:

Code:

soup.select('h3.foo1>a[href][title]').map(lambda link : (link.get("href"), link.get("title")))

Explanation:

soup.select('h3.foo1>a[href][title]')

Selects all the a elements that have a href and a title that are a direct child of an h3 element with the foo1 class.

.map(lambda link : 

For each of those a elements, replace each of them with what follows.

(link.get("href"), link.get("title"))

Make a tuple containing the link's href and title.

  • Related