When I try to scrape roster links, I get https://gwsports.com/roster.aspx?path=wpolo when I open it on chrome it changes to https://gwsports.com/sports/mens-water-polo/roster. I want to scrape it in proper format like the second one(https://gwsports.com/sports/mens-water-polo/roster).
pip install -U gazpacho
from gazpacho import get, Soup
url = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)
CodePudding user response:
This is not an issue with scraping, you're getting the exact URL that's on the page. Rather that URL redirects you to the final URL which is the one you need.
You can use requests
library to get the final URL:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
url = 'https://gwsports.com/roster.aspx?path=wpolo'
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
print(r.url) # URL after redirections
else:
print('Request failed')
Which makes your code like so:
from gazpacho import get, Soup
import requests
def get_final_url(url, root):
# Note this function assumes url is relative and always prepends root
# You may want to extend it to detect absolute URLs
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:
return r.url # URL after redirections
else:
raise requests.HTTPError
url = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root link.attrs['href'], root) for link in links]
print(s)
Output
['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']