Two part question -- First, when running the code without final if statement, I'm not getting all of the HREF tags... I see many more links in Inspector that don't seem to come through.
Looking for a fix, but also trying to understand general knowledge on this - is there a reason why some links would work and others would not?
Similarly, I wanted to pull the HREF tags that contain "Surf-Report". I've used this code with p.startswith, and it works... but I couldn't find what the function call would be to say "contains".
I'm new to all of this, looking but don't fully understand either of these.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-County-Surfing/278/'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
# print(profiles)
for p in profiles:
if p.contains('Surf-Report'):
print(p)
For context, my overall goal is to go to these different county pages, and get all of the HREF tags there. Once I have those, I want to visit each individual link and pull the wave sizes from each of the links stored there.
I'm looking to build a way to monitor all waves in New Jersey daily... no purpose, just a fun practice project with something I find interesting.
CodePudding user response:
Those urls in page appears to be fed into dynamically, via an (or more?) XHR call. Upon a brief inspection of that page' Dev tools - network tab, I noticed a call to an api (from which I stripped the variables). Scraping that api returns over 8k results:
import requests
import pandas as pd
import json
r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json())
print(df)
Result:
_id | _obj | _path | name | description | lat | lon | dataLat | dataLon | surfAreaId | dataSpotId | url | multiplier | optimumSwellAngle | optimumWindAngle | timezone | offset | modelName | isBigWave | ratingType | timeZoneAbbr | hasAdvancedForecast | proteusDataId | proteusResolution | surflineSpotId | defaultModelId | topLevelNav | tidalPort | isDataSpot | favouriteCount | mapImageUrl | breakingWaveModelId | weatherModel | added | hidden | edited | pointOfInterestId | useSDS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Spot | Spot | Newquay - Fistral North | 50.4184 | -5.0997 | 50.42 | -5.08 | 7 | nan | /Newquay-Fistral-North-Surf-Report/1/ | 0.7 | 290 | 110 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | UK_4m | 584204214e65fad6a7709cec | 42 | True | True | 0 | https://chart-1.msw.ms/maps/spot/2576f3cfb35dba07a84590141d54d3a5.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | c10396fc-ed41-4771-8e8e-ab8dbff5c67c | True | ||
1 | 2 | Spot | Spot | Porthtowan | 50.2891 | -5.2461 | 50.27 | -5.3 | 6 | nan | /Porthtowan-Surf-Report/2/ | 0.8 | 290 | 110 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | GLOB_30m | 5842041f4e65fad6a7708c98 | 38 | True | True | 0 | https://chart-3.msw.ms/maps/spot/d278b42dc4a8adc983a24e2c04333665.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | 39bca112-f093-4a7b-90eb-b7993920e5c4 | True | ||
2 | 3 | Spot | Spot | Gwithian | 50.2235 | -5.399 | 50.2 | -5.5 | 6 | nan | /Gwithian-Surf-Report/3/ | 0.5 | 285 | 105 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | GLOB_30m | 5842041f4e65fad6a7708c95 | 38 | True | Perranporth | True | 0 | https://chart-5.msw.ms/maps/spot/2a4608d0e793ee20f4566ca85f5ba6cd.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | 6b0785be-1efb-413d-a5a9-ba2133c6ef68 | True | |
3 | 4 | Spot | Spot | Sennen | 50.0802 | -5.6976 | 50.07 | -5.7 | 6 | nan | /Sennen-Surf-Report/4/ | 0.8 | 270 | 90 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | GLOB_30m | 5842041f4e65fad6a7708c97 | 38 | True | True | 0 | https://chart-4.msw.ms/maps/spot/c1be3fe6871d15e4ea5297193b8b81da.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | a641e633-8692-4d4b-b2d6-c4e1d4132c9b | True | ||
4 | 5 | Spot | Spot | Constantine | 50.5333 | -5.0221 | 50.5759 | -4.92239 | 8 | nan | /Constantine-Surf-Report/5/ | 1 | 270 | 90 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | GLOB_30m | 584204204e65fad6a77090b3 | 38 | True | True | 0 | https://chart-3.msw.ms/maps/spot/47b00f609d5e46cda66040d8b811bae6.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | 1daacdd5-a92a-4f7c-bc7c-af30a392ef7d | True | ||
5 | 6 | Spot | Spot | Bude - Crooklets | 50.8358 | -4.5548 | 50.8336 | -4.56057 | 8 | nan | /Bude-Crooklets-Surf-Report/6/ | 1 | 270 | 90 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | GLOB_30m | 5842041f4e65fad6a7708ca5 | 38 | True | True | 0 | https://chart-1.msw.ms/maps/spot/553d3a850372eee8b10d13d23cbdb78e.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | 6cb522d3-a781-45ae-83cd-fcc941fd47cb | True | ||
6 | 7 | Spot | Spot | Croyde Beach | 51.1302 | -4.2435 | 51.1449 | -4.25995 | 9 | nan | /Croyde-Beach-Surf-Report/7/ | 0.8 | 270 | 90 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | GLOB_30m | 5842041f4e65fad6a7708ca4 | 38 | True | Ilfracombe, England | True | 0 | https://chart-3.msw.ms/maps/spot/0f967e1e6130e9cb1b2623aafe966b58.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | 2dca4454-5789-4be3-808e-f512fef45dc3 | True | |
7 | 8 | Spot | Spot | Praa Sands | 50.103 | -5.391 | 50 | -3.87 | 5 | nan | /Praa-Sands-Surf-Report/8/ | 0.8 | 210 | 30 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | GLOB_30m | 5842041f4e65fad6a7708c9a | 38 | True | True | 0 | https://chart-4.msw.ms/maps/spot/aea8da3ce8bd22228c07c79db8e9b8de.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | a166dde9-2d1c-4a55-bb34-5a6efce93986 | True | ||
8 | 9 | Spot | Spot | Whitsand Bay | 50.3387 | -4.2434 | 50.3334 | -4.2433 | 5 | nan | /Whitsand-Bay-Surf-Report/9/ | 0.7 | 225 | 45 | Europe/London | 3600 | glo_30m | False | directional | BST | True | nan | UK_4m | 584204204e65fad6a77090c5 | 42 | True | True | 0 | https://chart-3.msw.ms/maps/spot/1fe1f342742ba3cf7dd3f8d9943948cc.png | nan | gfs.0p25 | -62169984000 | False | 1617982527 | c6b42c46-7db3-4e53-8f38-b28db957b4e7 | True | ||
9 | 10 | Spot | Spot | Bantham | 50.2787 | -3.8885 | 50 | -3.87 | 5 | nan | /Bantham-Surf-Report/10/ | 0.8 | 230 | 65 | Europe/London | 3600 | glo_30m | False | directional | BST | True | 2 | UK_4m | 584204204e65fad6a77090c9 | 42 | True | River Yealm | True | 0 | https://chart-1.msw.ms/maps/spot/358c02090c0c31888fee4794b39d397c.png | nan | gfs.0p25 | -62169984000 | False | 1646829186 | d3566d34-b58d-4803-8cf2-3e3dc5fc1a48 | True |
Is this what you're after?
CodePudding user response:
Since the website using JavaScript
to load content you have to render the page before trying to search for a
tags, so you should use requests_html
instead of requests
.
first you need to install requests-html
library by typing pip install requests-html
from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
urls = [
'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
'https://magicseaweed.com/New-Jersey-Ocean-County-Surfing/278/'
]
for url in urls:
r = session.get(url)
r.html.render(sleep=3, timeout=20) # wait for 3s until the page fully loaded
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
for p in profiles:
if p and 'Surf-Report' in p:
print(p)
The result obtained :
/Belmar-Surf-Report/3683/
/Manasquan-Surf-Report/386/
/Ocean-Grove-Surf-Report/7945/
/Asbury-Park-Surf-Report/857/
/Avon-Surf-Report/4050/
/Bay-Head-Surf-Report/4951/
/Belmar-Surf-Report/3683/
/Boardwalk-Surf-Report/9183/
/Bradley-Beach-Surf-Report/7944/
/Casino-Surf-Report/9175/
/Deal-Surf-Report/822/
/Dog-Park-Surf-Report/9174/
/Jenkinsons-Surf-Report/4053/
/Long-Branch-Surf-Report/7946/
/Long-Branch-Surf-Report/7947/
/Manasquan-Surf-Report/386/
/Monmouth-Beach-Surf-Report/4055/
/Ocean-Grove-Surf-Report/7945/
/Point-Pleasant-Surf-Report/7942/
/Sea-Girt-Surf-Report/7943/
/Spring-Lake-Surf-Report/7941/
/The-Cove-Surf-Report/385/
/Belmar-Surf-Report/3683/
/Avon-Surf-Report/4050/
/Deal-Surf-Report/822/
/34th-Street-Surf-Report/9058/
/Lavenia-Ave-Surf-Report/9060/
/Casino-Pier-Surf-Report/387/
/Harvey-Cedars-Surf-Report/7938/
/Hudson-Ave-Surf-Report/9062/
/Mantoloking-Surf-Report/9063/
/Ortley-Beach-Surf-Report/9064/
/Seaside-Park-Surf-Report/9280/
/Ship-Bottom-Surf-Report/9065/
/Spray-Beach-Surf-Report/9066/
/Surf-City-Surf-Report/388/
/34th-Street-Surf-Report/9058/
/Barnegat-Light-Surf-Report/7940/
/Beach-Haven-Surf-Report/7939/
/Casino-Pier-Surf-Report/387/
/Harvey-Cedars-Surf-Report/7938/
/Holyoke-Surf-Report/389/
/Hudson-Ave-Surf-Report/9062/
/Island-Beach-State-Park-Surf-Report/4052/
/Lavenia-Ave-Surf-Report/9060/
/Mantoloking-Surf-Report/9063/
/New-Jersey-Hurricane-Surf-Report/1094/
/Ortley-Beach-Surf-Report/9064/
/Seaside-Park-Surf-Report/9280/
/Ship-Bottom-Surf-Report/9065/
/Spray-Beach-Surf-Report/9066/
/Surf-City-Surf-Report/388/
/New-Jersey-Hurricane-Surf-Report/1094/
/New-Jersey-Hurricane-Surf-Report/1094/
/Casino-Pier-Surf-Report/387/