how to get specific links with BeautifulSoup?-CodePudding

I am trying to crawl HTML source with Python using BeautifulSoup.
I need to get the href of specific link <a> tags.

This is my test code. I want to get links <a href="/example/test/link/activity1~10"target="_blank">

<div >
   <div  id="activity">
   .
   .
   </div>
   <div >
      <div >
         <div >
            <div>...</div>
            <a href="/example/test/link/activity1" target="_blank">
               <div >
                  <span> 0x1292311</span>
               </div>
            </a>
         </div>
      </div>
      <div >
         <div >
            <div>...</div>
            <a href="/example/test/link/activity2" target="_blank">
               <div >
                  <span> 0x1292312</span>
               </div>
            </a>
         </div>
      </div>
      .
      .
      .
   </div>
</div>

CodePudding user response：

Check the main page of the bs4 documentation:

for link in soup.find_all('a'):
    print(link.get('href'))

CodePudding user response：

This is a code for the problem. You should find the all <a></a>, then to getting the value of href.

soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('a'):
    if i['target'] == "_blank":
        print(i['href'])

Hope my answer could help you.

CodePudding user response：

Select the <a> specific - lternative to @Mason Ma answer you can also use css selectors:

soup.select('.activity_content a')]

or by its attribute target -

soup.select('.activity_content a[target="_blank"]')

Example

Will give you a list of links, matching your condition:

import requests
from bs4 import BeautifulSoup

html = '''
<div >
      <div >
         <div >
            <div>...</div>
            <a href="/example/test/link/activity1" target="_blank">
               <div >
                  <span> 0x1292311</span>
               </div>
            </a>
         </div>
      </div>
      <div >
         <div >
            <div>...</div>
            <a href="/example/test/link/activity2" target="_blank">
               <div >
                  <span> 0x1292312</span>
               </div>
            </a>
         </div>
      </div>
'''
soup = BeautifulSoup(html)

[x['href'] for x in soup.select('.activity_content a[target="_blank"]')]

Output

['/example/test/link/activity1', '/example/test/link/activity2']