Home > Blockchain >  How Do I Go About Scraping All URL's Within a <div and or <ul class? (Nesting / CSS Selec
How Do I Go About Scraping All URL's Within a <div and or <ul class? (Nesting / CSS Selec

Time:08-11

I am currently doing research on just how good PGA Tour golfers are and what differentiates them from the majority of golfers. PGATour.com has a statistics page that shows every tournament with updated statistics all the way back to 1980. Metrics such as GIR, FIR, SS, UPD, etc.

I'd like all of these stats in a centralized dataset and I'm about 50% of the way done

This is the code I've tried thus far.

from bs4 import BeautifulSoup, SoupStrainer
import requests

#Define the URL to extract from
url = "https://www.pgatour.com/stats/categories.RAPP_INQ.html"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    df = print(link.get('href'))

You can run it yourself but it returns such a huge amount of data that isn't clean along with split URL's that I technically COULD append. The problem would be how to tell Python, "Hey just these URL's, append these to "https://www.PGATour.com"

A much more efficient way to simplify my code would be to simply nest the <div class that I want URL's from from the URL I want to scrape from.

Would I go into the source code for that or simply get all of the URL's from the inspect element page?

I'd rather go with the latter for the sake of efficiency but if you could point me in the right direction on how to learn how to do this, I'd be very grateful.

I've done tons of Google searches and have even watched Keith Galli's videos on webscraping but maybe I just need sleep after having this project on my mind for days. Just want to get it over with.

Thank you so much!

CodePudding user response:

Based on your comment - You could use css selectors as mentioned - Search for the <div> that contains your heading and choose the next sibling <div>:

soup.select('.header:-soup-contains("Greens in Regulation")   div ul li a[href]')   

Example

from bs4 import BeautifulSoup, SoupStrainer
import requests

#Define the URL to extract from
url = "https://www.pgatour.com/stats/categories.RAPP_INQ.html"
page = requests.get(url).text
soup = BeautifulSoup(page)

[
    {'link':'https://www.PGATour.com' a.get('href'),
     'categorie':a.find_previous('h3').text
     } 
    
    for a in soup.select('.header:-soup-contains("Greens in Regulation")   div ul li a[href]')
]

or more generic to get all the links from the section categoriezed, so you could filter you data in dataframe:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.pgatour.com/stats/categories.RAPP_INQ.html"
soup = BeautifulSoup(requests.get(url).text)

data = [
    {'link':'https://www.pgatour.com' a.get('href'),
     'categorie':a.find_previous('h3').text
     } 
    for a in soup.select('.module-statistics-off-the-tee-table ul li a[href]')
]

pd.DataFrame(data)#to_excel('myfile.xlsx')
  • Related