I was wondering if someone could help me scrape an href tag and clean it up. I am trying to scrape the url from the big "Visit Website" button on this page: https://www.goodfirms.co/software/inflow-inventory, and then clean it up a little bit.
Code:
url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
time.sleep(2)
soup = bs(page.content, 'lxml')
try:
url = soup.find("div", class_="entity-detail-header-visit-website")
except AttributeError:
url = "Couldn't Find"
Print(url)
Output Print:
<div >
<a href="https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile" rel="nofollow" target="_blank">Visit website</a>
</div>
Desired Output:
https://www.inflowinventory.com
CodePudding user response:
This will get you what you need:
import requests
from bs4 import BeautifulSoup
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
r = requests.get('https://www.goodfirms.co/software/inflow-inventory', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
link = soup.select_one('a.visit-website-btn')
print(link['href'].split('/?utm')[0])
Result:
https://www.inflowinventory.com
Documentation for BeautifulSoup can be found at:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
CodePudding user response:
Try this code to get @href
value
url = soup.find("a", class_="visit-website-btn").get('href')
Having complete URL you can get base with
from urllib.parse import urlsplit
print(urlsplit(url).netloc)
# www.inflowinventory.com
CodePudding user response:
"div", class_="entity-detail-header-visit-website"
detects the same url two times with html content. So .a.get('href')
with find() method will pull the righ url
import requests
from bs4 import BeautifulSoup
url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
link = soup.find("div", class_="entity-detail-header-visit-website").a.get('href')
print(link)
Output:
https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile
CodePudding user response:
If you are looking for a solution according to your code then it is like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
try:
url = soup.find("div", class_="entity-detail-header-visit-website")
print(url.a.get('href'))
except AttributeError:
url = "Couldn't Find"
print(url)
Result :
https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile