Scraping an href-CodePudding

I was wondering if someone could help me scrape an href tag and clean it up. I am trying to scrape the url from the big "Visit Website" button on this page: https://www.goodfirms.co/software/inflow-inventory, and then clean it up a little bit.

Code:

url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
time.sleep(2)
soup = bs(page.content, 'lxml')
try:
    url = soup.find("div", class_="entity-detail-header-visit-website")
except AttributeError:
    url = "Couldn't Find"
Print(url)

Output Print:

<div >
<a  href="https://www.inflowinventory.com/?utm_source=goodfirms&amp;utm_medium=profile" rel="nofollow" target="_blank">Visit website</a>
</div>

Desired Output:

https://www.inflowinventory.com

CodePudding user response：

This will get you what you need:

import requests
from bs4 import BeautifulSoup

headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}

r = requests.get('https://www.goodfirms.co/software/inflow-inventory', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

link = soup.select_one('a.visit-website-btn')
print(link['href'].split('/?utm')[0])

Result:

https://www.inflowinventory.com

Documentation for BeautifulSoup can be found at:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

CodePudding user response：

Try this code to get @href value

url = soup.find("a", class_="visit-website-btn").get('href')

Having complete URL you can get base with

from urllib.parse import urlsplit

print(urlsplit(url).netloc)
#  www.inflowinventory.com

CodePudding user response：

"div", class_="entity-detail-header-visit-website" detects the same url two times with html content. So .a.get('href') with find() method will pull the righ url

import requests
from bs4 import BeautifulSoup

url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')

link = soup.find("div", class_="entity-detail-header-visit-website").a.get('href')
print(link)

Output:

https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile

CodePudding user response：

If you are looking for a solution according to your code then it is like this.

import requests
from bs4 import BeautifulSoup

url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')

try:
   url = soup.find("div", class_="entity-detail-header-visit-website")
   print(url.a.get('href'))

except AttributeError:
   url = "Couldn't Find"
   print(url)

Result :

https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile