Home > Back-end >  Looking For Simple Python Scraping Help: Having Trouble Identifying Sections and Class with Beautifu
Looking For Simple Python Scraping Help: Having Trouble Identifying Sections and Class with Beautifu

Time:11-21

I am trying to learn how to scrape data. I am very new to Python, so bare with me. Upon searching YouTube, I found a tutorial and tried to scrape some data off of "https://www.pgatour.com/competition/2022/hero-world-challenge/leaderboard.html"

ENTIRE BOX FINISHING POSITION

from bs4 import BeautifulSoup
import requests

SCRAPE = requests.get("https://www.pgatour.com/competition/2022/hero-world-challenge/leaderboard.html")

print(SCRAPE)

#Response [200] = Succesful...

#http response status codes
    #Information Responses 100-199
    #Successful 200-299
    #Redirects 300-399
    #Client Errors 400-499
    #Server Errors 500-599

soup = BeautifulSoup(SCRAPE.content, 'html.parser')

#tells that the data is html and we need to parse it 

table = soup.find_all('div', class_="leaderboard leaderboard-table large" )

#pick the large section that contains all the info you need
    #then, pick each smaller section, find the type and class.

for list in table:
    name = list.find('div', class_="player-name-col")
    position = list.find('td', class_="position")
    total = list.find('td', class_="total")
    
    print(name, position, total)

Above is my code.. I also included pictures with the inspect open so I can show you what I was thinking when I tried to find the type and class within the leaderboard.

When I print, nothing happens. Any help is appreciated!

CodePudding user response:

Data is loaded dynamically by JavaScript and bs4 can't render JS that's why your code is printing nothing but you can pull the required data from API.

Example:

import pandas as pd
import requests

api_url= 'https://lbdata.pgatour.com/2022/r/478/leaderboard.json?userTrackingId=eyJhbGciOiJIUzI1NiJ9.eyJpYXQiOjE2Njg5OTEzNTcsIm5iZiI6MTY2ODk5MTM1NywiZXhwIjoxNjY4OTkzMTU3fQ.eTvZpdJgVp5yzSQz4J8n8ovzaBnKPmLhZm6gfitKJeU'
headers={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }
data=[]

res=requests.get(api_url,headers=headers)
#print(res)
for item in res.json()['rows']:
    startRanks = item['total']
    data.append({'total':startRanks})

df= pd.DataFrame(data)
print(df)

Output:

         total
0         -18
1         -17
2         -15
3         -15
4         -14
5         -14
6         -13
7         -13
8         -11
9         -11
10        -11
11        -10
12        -10
13         -8
14         -8
15         -8
16         -7
17         -6
18          1
19          6
  • Related