Home > OS >  BeautifulSoup not scraping all HREF links
BeautifulSoup not scraping all HREF links

Time:08-04

Two part question -- First, when running the code without final if statement, I'm not getting all of the HREF tags... I see many more links in Inspector that don't seem to come through.

Looking for a fix, but also trying to understand general knowledge on this - is there a reason why some links would work and others would not?

Similarly, I wanted to pull the HREF tags that contain "Surf-Report". I've used this code with p.startswith, and it works... but I couldn't find what the function call would be to say "contains".

I'm new to all of this, looking but don't fully understand either of these.

import requests
from bs4 import BeautifulSoup

profiles = []
urls = [
    'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
    'https://magicseaweed.com/New-Jersey-Ocean-County-Surfing/278/'
]
for url in urls:
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    for profile in soup.find_all('a'):

        profile = profile.get('href')

        profiles.append(profile)

# print(profiles)

for p in profiles:
    if p.contains('Surf-Report'):
        print(p)

For context, my overall goal is to go to these different county pages, and get all of the HREF tags there. Once I have those, I want to visit each individual link and pull the wave sizes from each of the links stored there.

I'm looking to build a way to monitor all waves in New Jersey daily... no purpose, just a fun practice project with something I find interesting.

CodePudding user response:

Those urls in page appears to be fed into dynamically, via an (or more?) XHR call. Upon a brief inspection of that page' Dev tools - network tab, I noticed a call to an api (from which I stripped the variables). Scraping that api returns over 8k results:

import requests
import pandas as pd
import json

r = requests.get('https://magicseaweed.com/api/mdkey/spot?&limit=-1')
df = pd.DataFrame(r.json())
print(df)

Result:

_id _obj _path name description lat lon dataLat dataLon surfAreaId dataSpotId url multiplier optimumSwellAngle optimumWindAngle timezone offset modelName isBigWave ratingType timeZoneAbbr hasAdvancedForecast proteusDataId proteusResolution surflineSpotId defaultModelId topLevelNav tidalPort isDataSpot favouriteCount mapImageUrl breakingWaveModelId weatherModel added hidden edited pointOfInterestId useSDS
0 1 Spot Spot Newquay - Fistral North 50.4184 -5.0997 50.42 -5.08 7 nan /Newquay-Fistral-North-Surf-Report/1/ 0.7 290 110 Europe/London 3600 glo_30m False directional BST True nan UK_4m 584204214e65fad6a7709cec 42 True True 0 https://chart-1.msw.ms/maps/spot/2576f3cfb35dba07a84590141d54d3a5.png nan gfs.0p25 -62169984000 False 1617982527 c10396fc-ed41-4771-8e8e-ab8dbff5c67c True
1 2 Spot Spot Porthtowan 50.2891 -5.2461 50.27 -5.3 6 nan /Porthtowan-Surf-Report/2/ 0.8 290 110 Europe/London 3600 glo_30m False directional BST True nan GLOB_30m 5842041f4e65fad6a7708c98 38 True True 0 https://chart-3.msw.ms/maps/spot/d278b42dc4a8adc983a24e2c04333665.png nan gfs.0p25 -62169984000 False 1617982527 39bca112-f093-4a7b-90eb-b7993920e5c4 True
2 3 Spot Spot Gwithian 50.2235 -5.399 50.2 -5.5 6 nan /Gwithian-Surf-Report/3/ 0.5 285 105 Europe/London 3600 glo_30m False directional BST True nan GLOB_30m 5842041f4e65fad6a7708c95 38 True Perranporth True 0 https://chart-5.msw.ms/maps/spot/2a4608d0e793ee20f4566ca85f5ba6cd.png nan gfs.0p25 -62169984000 False 1617982527 6b0785be-1efb-413d-a5a9-ba2133c6ef68 True
3 4 Spot Spot Sennen 50.0802 -5.6976 50.07 -5.7 6 nan /Sennen-Surf-Report/4/ 0.8 270 90 Europe/London 3600 glo_30m False directional BST True nan GLOB_30m 5842041f4e65fad6a7708c97 38 True True 0 https://chart-4.msw.ms/maps/spot/c1be3fe6871d15e4ea5297193b8b81da.png nan gfs.0p25 -62169984000 False 1617982527 a641e633-8692-4d4b-b2d6-c4e1d4132c9b True
4 5 Spot Spot Constantine 50.5333 -5.0221 50.5759 -4.92239 8 nan /Constantine-Surf-Report/5/ 1 270 90 Europe/London 3600 glo_30m False directional BST True nan GLOB_30m 584204204e65fad6a77090b3 38 True True 0 https://chart-3.msw.ms/maps/spot/47b00f609d5e46cda66040d8b811bae6.png nan gfs.0p25 -62169984000 False 1617982527 1daacdd5-a92a-4f7c-bc7c-af30a392ef7d True
5 6 Spot Spot Bude - Crooklets 50.8358 -4.5548 50.8336 -4.56057 8 nan /Bude-Crooklets-Surf-Report/6/ 1 270 90 Europe/London 3600 glo_30m False directional BST True nan GLOB_30m 5842041f4e65fad6a7708ca5 38 True True 0 https://chart-1.msw.ms/maps/spot/553d3a850372eee8b10d13d23cbdb78e.png nan gfs.0p25 -62169984000 False 1617982527 6cb522d3-a781-45ae-83cd-fcc941fd47cb True
6 7 Spot Spot Croyde Beach 51.1302 -4.2435 51.1449 -4.25995 9 nan /Croyde-Beach-Surf-Report/7/ 0.8 270 90 Europe/London 3600 glo_30m False directional BST True nan GLOB_30m 5842041f4e65fad6a7708ca4 38 True Ilfracombe, England True 0 https://chart-3.msw.ms/maps/spot/0f967e1e6130e9cb1b2623aafe966b58.png nan gfs.0p25 -62169984000 False 1617982527 2dca4454-5789-4be3-808e-f512fef45dc3 True
7 8 Spot Spot Praa Sands 50.103 -5.391 50 -3.87 5 nan /Praa-Sands-Surf-Report/8/ 0.8 210 30 Europe/London 3600 glo_30m False directional BST True nan GLOB_30m 5842041f4e65fad6a7708c9a 38 True True 0 https://chart-4.msw.ms/maps/spot/aea8da3ce8bd22228c07c79db8e9b8de.png nan gfs.0p25 -62169984000 False 1617982527 a166dde9-2d1c-4a55-bb34-5a6efce93986 True
8 9 Spot Spot Whitsand Bay 50.3387 -4.2434 50.3334 -4.2433 5 nan /Whitsand-Bay-Surf-Report/9/ 0.7 225 45 Europe/London 3600 glo_30m False directional BST True nan UK_4m 584204204e65fad6a77090c5 42 True True 0 https://chart-3.msw.ms/maps/spot/1fe1f342742ba3cf7dd3f8d9943948cc.png nan gfs.0p25 -62169984000 False 1617982527 c6b42c46-7db3-4e53-8f38-b28db957b4e7 True
9 10 Spot Spot Bantham 50.2787 -3.8885 50 -3.87 5 nan /Bantham-Surf-Report/10/ 0.8 230 65 Europe/London 3600 glo_30m False directional BST True 2 UK_4m 584204204e65fad6a77090c9 42 True River Yealm True 0 https://chart-1.msw.ms/maps/spot/358c02090c0c31888fee4794b39d397c.png nan gfs.0p25 -62169984000 False 1646829186 d3566d34-b58d-4803-8cf2-3e3dc5fc1a48 True

Is this what you're after?

CodePudding user response:

Since the website using JavaScript to load content you have to render the page before trying to search for a tags, so you should use requests_html instead of requests.
first you need to install requests-html library by typing pip install requests-html

from requests_html import HTMLSession
from bs4 import BeautifulSoup
profiles = []
session = HTMLSession()
urls = [
    'https://magicseaweed.com/New-Jersey-Monmouth-County-Surfing/277/',
    'https://magicseaweed.com/New-Jersey-Ocean-County-Surfing/278/'
]
for url in urls:
    r = session.get(url)
    r.html.render(sleep=3, timeout=20) # wait for 3s until the page fully loaded
    soup = BeautifulSoup(r.html.raw_html, "html.parser")
    for profile in soup.find_all('a'):
        profile = profile.get('href')
        profiles.append(profile)
for p in profiles:
    if p and 'Surf-Report' in p:
        print(p)

The result obtained :

/Belmar-Surf-Report/3683/     
/Manasquan-Surf-Report/386/   
/Ocean-Grove-Surf-Report/7945/
/Asbury-Park-Surf-Report/857/ 
/Avon-Surf-Report/4050/       
/Bay-Head-Surf-Report/4951/   
/Belmar-Surf-Report/3683/
/Boardwalk-Surf-Report/9183/
/Bradley-Beach-Surf-Report/7944/
/Casino-Surf-Report/9175/
/Deal-Surf-Report/822/
/Dog-Park-Surf-Report/9174/
/Jenkinsons-Surf-Report/4053/
/Long-Branch-Surf-Report/7946/
/Long-Branch-Surf-Report/7947/
/Manasquan-Surf-Report/386/
/Monmouth-Beach-Surf-Report/4055/
/Ocean-Grove-Surf-Report/7945/
/Point-Pleasant-Surf-Report/7942/
/Sea-Girt-Surf-Report/7943/
/Spring-Lake-Surf-Report/7941/
/The-Cove-Surf-Report/385/
/Belmar-Surf-Report/3683/
/Avon-Surf-Report/4050/
/Deal-Surf-Report/822/
/34th-Street-Surf-Report/9058/
/Lavenia-Ave-Surf-Report/9060/
/Casino-Pier-Surf-Report/387/
/Harvey-Cedars-Surf-Report/7938/
/Hudson-Ave-Surf-Report/9062/
/Mantoloking-Surf-Report/9063/
/Ortley-Beach-Surf-Report/9064/
/Seaside-Park-Surf-Report/9280/
/Ship-Bottom-Surf-Report/9065/
/Spray-Beach-Surf-Report/9066/
/Surf-City-Surf-Report/388/
/34th-Street-Surf-Report/9058/
/Barnegat-Light-Surf-Report/7940/
/Beach-Haven-Surf-Report/7939/
/Casino-Pier-Surf-Report/387/
/Harvey-Cedars-Surf-Report/7938/
/Holyoke-Surf-Report/389/
/Hudson-Ave-Surf-Report/9062/
/Island-Beach-State-Park-Surf-Report/4052/
/Lavenia-Ave-Surf-Report/9060/
/Mantoloking-Surf-Report/9063/
/New-Jersey-Hurricane-Surf-Report/1094/
/Ortley-Beach-Surf-Report/9064/
/Seaside-Park-Surf-Report/9280/
/Ship-Bottom-Surf-Report/9065/
/Spray-Beach-Surf-Report/9066/
/Surf-City-Surf-Report/388/
/New-Jersey-Hurricane-Surf-Report/1094/
/New-Jersey-Hurricane-Surf-Report/1094/
/Casino-Pier-Surf-Report/387/
  • Related