I have content:
<p><a href="/dms_pubrec/itu-t/rec/q/T-REC-Q.1238.3-200006-I!!TOC-TXT-E.txt" target="_blank"><strong><font size="1">Table of Contents </font></strong></a></p>
</td>
</tr>
<tr>
<td width="80%"> </td>
<td align="right" bgcolor="#FFFF80" style="font-size: 9pt;">
<p><a href="./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a></p>
</td>
</tr>
<tr>
<td colspan="2" style="font-size: 9pt;color: red;">
<p>This Recommendation includes an electronic attachment containing the ASN.1 definitions for the IN CS-3 SCF-SRF interface</p>
</td>
</tr>
I want to extract:
- text from following href and
- link, where following text Summary
<a href="./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt" target="_blank"><strong><font size="1">Summary </font></strong></a>
My code:
import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'html.parser')
CodePudding user response:
You want to pull the url which is associated with the text Summary
:
import requests
from bs4 import BeautifulSoup
url = "https://www.itu.int/rec/T-REC-Q.1238.3-200006-I/en"
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'html.parser')
link= soup.select_one('a:-soup-contains("Summary")').get('href')
print('https://www.itu.int/rec/T-REC-Q.1238.3-200006-I' link)
Output:
https://www.itu.int/rec/T-REC-Q.1238.3-200006-I./htmldoc.asp?doc=t\rec\q\T-REC-Q.1238.3-200006-I!!SUM-TXT-E.txt
CodePudding user response:
If you want to get the content and href
links in an <a>
tag you can loop over the content with find_all
as follows:
for a in soup.find_all('a', href=True):
return a.contents, a['href']