I am writing a web scraper and am struggling to get the href link from a web page. The URL is https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp. I am trying to get the href link below:
<div >
<a href="https://vcnewsdaily.com/Tessera Therapeutics/venture-funding.php"> >> Click here for more funding data on Tessera Therapeutics</a>
</div>
Here is my code:
from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.findAll(class_='mb-2'):
links.append(link.get('href'))
print(links)
When I run the code, it outputs:
[None, None, None, None]
Can someone guide me in the right direction?
CodePudding user response:
The variable link
doesn't contain <a>
tag with href=
attribute. To select all <a>
under tags with class .mb-2
you can use for example CSS selector:
import requests
from bs4 import BeautifulSoup
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.select(".mb-2 a"): # <-- select <a> tags here
links.append(link.get("href"))
print(links)
Prints:
['https://vcnewsdaily.com/Tessera Therapeutics/venture-funding.php', 'https://vcnewsdaily.com/marketing.php']
CodePudding user response:
Your code almost works, just use find
instead of get
and search for a
:
from cgi import print_directory
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re
URL = "https://vcnewsdaily.com/tessera-therapeutics/venture-capital-funding/rsgclpxrcp"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for link in soup.findAll(class_='mb-2'):
links.append(link.find('a'))
print(links)