I am trying to scrape data from a table on the following website: https://www.eliteprospects.com/league/nhl/stats/2021-2022
This is the code I found to successfully scrape off data from the first table for skater stats:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1,10):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
But I am having difficulty scraping off the goalie stats from the bottom table. Any idea how to modify the code to get the stats from the bottom table? I tried changing line 13 to "(".goalie-stats")" but it returned an error when I tried to run the code.
Thank you!!
CodePudding user response:
I found a way to get the data, but it isn't perfect. When I get it, it makes a lot of unnamed columns. Still, it gets the data, so I hope it's helpful
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1,3):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort-goalie-stats=svp&page-goalie={page}#goalies"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".goalie-stats")).replace('%', ''))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)