Home > Enterprise >  How to get the proper link from a website using python beautifulsoup?
How to get the proper link from a website using python beautifulsoup?

Time:03-09

When I try to scrape roster links, I get https://gwsports.com/roster.aspx?path=wpolo when I open it on chrome it changes to https://gwsports.com/sports/mens-water-polo/roster. I want to scrape it in proper format like the second one(https://gwsports.com/sports/mens-water-polo/roster).

pip install -U gazpacho

from gazpacho import get, Soup

url = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)

CodePudding user response:

This is not an issue with scraping, you're getting the exact URL that's on the page. Rather that URL redirects you to the final URL which is the one you need.
You can use requests library to get the final URL:

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
    'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}

url = 'https://gwsports.com/roster.aspx?path=wpolo'

r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
    print(r.url) # URL after redirections
else:
    print('Request failed')

Which makes your code like so:

from gazpacho import get, Soup
import requests

def get_final_url(url, root):
  # Note this function assumes url is relative and always prepends root
  # You may want to extend it to detect absolute URLs
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
    'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}

  r = requests.get(url, allow_redirects=True, headers=headers)
  if r.status_code == 200:
    return r.url # URL after redirections
  else:
    raise requests.HTTPError

url = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root   link.attrs['href'], root) for link in links]
print(s)

Output

['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']
  • Related