I am trying to crawl HTML source with Python using BeautifulSoup.
I need to get the href of specific link <a>
tags.
This is my test code. I want to get links <a href="/example/test/link/
activity1~10"target="_blank">
<div >
<div id="activity">
.
.
</div>
<div >
<div >
<div >
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div >
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div >
<div >
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div >
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
.
.
.
</div>
</div>
CodePudding user response:
Check the main page of the bs4 documentation:
for link in soup.find_all('a'):
print(link.get('href'))
CodePudding user response:
This is a code for the problem. You should find the all <a></a>
, then to getting the value of href.
soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('a'):
if i['target'] == "_blank":
print(i['href'])
Hope my answer could help you.
CodePudding user response:
Select the <a>
specific - lternative to @Mason Ma answer you can also use css selectors
:
soup.select('.activity_content a')]
or by its attribute target
-
soup.select('.activity_content a[target="_blank"]')
Example
Will give you a list of links, matching your condition:
import requests
from bs4 import BeautifulSoup
html = '''
<div >
<div >
<div >
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div >
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div >
<div >
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div >
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
'''
soup = BeautifulSoup(html)
[x['href'] for x in soup.select('.activity_content a[target="_blank"]')]
Output
['/example/test/link/activity1', '/example/test/link/activity2']