Home > OS >  How to scrape url links when the website takes us to a splash screen?
How to scrape url links when the website takes us to a splash screen?

Time:08-08

import requests
from bs4 import BeautifulSoup
import re
R = []
url = "https://ascscotties.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'html.parser')
links= soup.find_all('a',href=re.compile("roster"))
s=[url   link.get("href") for link in links]
for i in s:
 r = requests.get(i, allow_redirects=True, headers=headers)
 if r.status_code < 400:
  R.append(r.url)

Output
['https://ascscotties.com/sports/womens-basketball/roster',
'https://ascscotties.com/sports/womens-cross-country/roster',
'https://ascscotties.com/sports/womens-soccer/roster',
'https://ascscotties.com/sports/softball/roster',
'https://ascscotties.com/sports/womens-tennis/roster',
'https://ascscotties.com/sports/womens-volleyball/roster']

The code looks for roster links from url's and gives output, but like "https://auyellowjackets.com/" it fails as the url takes use to a splash screen. What can be done?

CodePudding user response:

The site uses a cookie to indicate it has shown a splash screen before. So set it to get to the main page:

import re
import requests
from bs4 import BeautifulSoup

R = []
url = "https://auyellowjackets.com"

cookies = {"splash_2": "splash_2"}  # <--- set cookie

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; "
    "Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0"
}
reqs = requests.get(url, headers=headers, cookies=cookies)
soup = BeautifulSoup(reqs.text, "html.parser")
links = soup.find_all("a", href=re.compile("roster"))
s = [url   link.get("href") for link in links]
for i in s:
    r = requests.get(i, allow_redirects=True, headers=headers)
    if r.status_code < 400:
        R.append(r.url)

print(*R, sep="\n")

Prints:

https://auyellowjackets.com/sports/mens-basketball/roster
https://auyellowjackets.com/sports/mens-cross-country/roster
https://auyellowjackets.com/sports/football/roster
https://auyellowjackets.com/sports/mens-track-and-field/roster
https://auyellowjackets.com/sports/mwrest/roster
https://auyellowjackets.com/sports/womens-basketball/roster
https://auyellowjackets.com/sports/womens-cross-country/roster
https://auyellowjackets.com/sports/womens-soccer/roster
https://auyellowjackets.com/sports/softball/roster
https://auyellowjackets.com/sports/womens-track-and-field/roster
https://auyellowjackets.com/sports/volleyball/roster
  • Related