Home > Blockchain >  How parse a link by text?
How parse a link by text?

Time:06-08

I have content:

<p><a href="/dms_pubrec/itu-t/rec/q/T-REC-Q.1238.3-200006-I!!TOC-TXT-E.txt" target="_blank"><strong><font size="1">Table of Contents </font></strong></a></p>
</td>
</tr>
<tr>
<td width="80%">   </td>
<td align="right" bgcolor="#FFFF80" style="font-size: 9pt;">
<p><a href="./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a></p>
</td>
</tr>
<tr>
<td colspan="2" style="font-size: 9pt;color: red;">
<p>This Recommendation includes an electronic attachment containing the ASN.1 definitions for the IN CS-3 SCF-SRF interface</p>
</td>
</tr>

I want to extract:

  • text from following href and
  • link, where following text Summary
<a href="./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a>

My code:

import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"

q = requests.get(url)
result = q.content

soup = BeautifulSoup(result, 'html.parser')

CodePudding user response:

You want to pull the url which is associated with the text Summary :

import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"

q = requests.get(url)
result = q.content

soup = BeautifulSoup(result, 'html.parser')

link= soup.select_one('a:-soup-contains("Summary")').get('href')

print('https://www.itu.int/rec/T-REC-Q.1238.3-200006-I' link)

Output:

https://www.itu.int/rec/T-REC-Q.1238.3-200006-I./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt

CodePudding user response:

If you want to get the content and href links in an <a> tag you can loop over the content with find_all as follows:

for a in soup.find_all('a', href=True):
    return a.contents, a['href']
  • Related