I want to scrape players' stats from NFL website.
Data is distributed on different pages and while there is "next page" link, I want to scrape data and combine into one dataframe.
Here is my code:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
url='https://www.nfl.com/stats/player-stats/category/passing/2020/reg/all/passingyards/desc'
soup=bs(requests.get(url).content,'html.parser')
data=pd.DataFrame()
while soup.select('a[title="Next Page"]')[0]['href']:
next_url='https://www.nfl.com' soup.select('a[title="Next Page"]')[0]['href']
df=pd.read_html(next_url)[0]
soup=bs(requests.get(next_url).content,'html.parser')
data=pd.concat([data,df])
break
This scrapes data from the first page only. If I remove break from the end of the code, it collects all the data from all pages but gives an error saying that next page does not exist.
How do it properly. Do I need while loop in this situation or is there another option?
CodePudding user response:
Your issue is that when soup.select('a[title="Next Page"]')
fails because there are no more pages, you get a list index out of range error because of your while
condition: soup.select('a[title="Next Page"]')[0]['href']
. If you remove the indexing from the while condition, the problem will go away:
while soup.select('a[title="Next Page"]'):
This change gives a result of 94 rows as of 2022-08-25.