I would like to extract the content of the 1st <a href>
from this <div>
<div ><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage=1&pageCount=28&author=Saturnino+M.+Borras+Jr.%2C+%2C+Ian+Scoones%2C+et+al&orderBeanReset=true&imprint=Routledge&volumeNum=49&issueNum=1&contentID=10.1080%2F03066150.2021.1956473&title=Climate+change+and+agrarian+struggles%3A+an+invitation+to+contribute+to+a+JPS+Forum&numPages=28&pa=&oa=CC-BY-NC-ND&issn=0306-6150&publisherName=tandfuk&publication=FJPS&rpt=n&endPage=28&publicationDate=01%2F02%2F2022"
target="_blank" title="Opens new window">Permissions</a>\xa0</div>
<a href="/doi/abs/10.1080/03066150.2021.1956473">
I'm using BeautifulSoup and I'm also scraping some other contents from the same page and by using the following solution as result for abstract
I'm having None
for article_entry in article_list_items:
title_article = article_entry.find('span', class_='hlFld-Title').text
author = article_entry.find('span', class_='articleEntryAuthorsLinks').text
abstract = article_entry.find('a', class_='tocDeliverFormatsLinks')
print(author, title_article, abstract)
Saturnino M. Borras Jr., Ian Scoones, Amita Baviskar, Marc Edelman, Nancy Lee Peluso & Wendy Wolford Climate change and agrarian struggles: an invitation to contribute to a JPS Forum None
Is there a system to reach the first href by using something similar to 'a'[:1]
?
CodePudding user response:
You can select a list then slicing or use select_one
as css selector to select single element as follows:
html_doc = '''<div ><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage=1&pageCount=28&author=Saturnino+M.+Borras+Jr.%2C+%2C+Ian+Scoones%2C+et+al&orderBeanReset=true&imprint=Routledge&volumeNum=49&issueNum=1&contentID=10.1080%2F03066150.2021.1956473&title=Climate+change+and+agrarian+struggles%3A+an+invitation+to+contribute+to+a+JPS+Forum&numPages=28&pa=&oa=CC-BY-NC-ND&issn=0306-6150&publisherName=tandfuk&publication=FJPS&rpt=n&endPage=28&publicationDate=01%2F02%2F2022"
target="_blank" title="Opens new window">Permissions</a>\xa0</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
href = soup.select_one('div.tocDeliverFormatsLinks a').get('href')
print(href)
Oupput:
/doi/abs/10.1080/03066150.2021.1956473