I want to web scrape the info from a leader board on this website, but it only shows 25 entries at once and the url doesn't change when you press the "next" button to get the next 25 entries. What I want to do is to get all the "number of rescues" from all the entries in the leader board so I can check if the "number of rescues" follows a pareto-distribution (meaning that, for example, the top 20 % of all people are responsible for 80 % of all rescues).
So I can get the first 25 entries no problem like this:
import requests
from bs4 import BeautifulSoup
url = 'https://fuelrats.com/leaderboard'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
rows = soup.findAll('div',{'class':"rt-tr-group",'role':"rowgroup"})
print(len(rows))
but after that, I don't know how to press "next" from python and then get the next 25 entries. How can I do that? Is it even possible with just requests and bs4?
CodePudding user response:
There is another way of getting that data, by scraping the API endpoint that page is being hydrated from. The API url can be found in Dev Tools - Network tab, under 'XHR' section.
import requests
import pandas as pd
r = requests.get('https://fuelrats.com/api/fr/leaderboard?page[offset]=0&page[limit]=5000')
df = pd.json_normalize(r.json()['data'])
print(df)
Result in terminal:
type id attributes.preferredName attributes.ratNames attributes.joinedAt attributes.rescueCount attributes.codeRedCount attributes.isDispatch attributes.isEpic links.self
0 leaderboard-entries e9520722-02d2-4d69-9dba-c4e3ea727b14 Aleethia [Aanyath, Aleethia, Konisho, Ravenov] 2015-07-28T00:06:53.000Z 15467 2228 False False https://api.fuelrats.com/leaderboard-entries/e...
1 leaderboard-entries 5ed94356-bdcc-4139-9208-3cec320d51c9 Elysiumchains [Alysianfolly, Elysianfields, Elysiumchains, E... 2018-06-22T00:07:59.000Z 8476 810 True False https://api.fuelrats.com/leaderboard-entries/5...
2 leaderboard-entries bb1d04cd-2889-4994-ad6a-3fb881a5d243 Caleb Dume [Agamedes, Alcamenes, Andocides, Argonides, Ca... 2019-09-14T03:40:37.101Z 3163 309 True False https://api.fuelrats.com/leaderboard-entries/b...
3 leaderboard-entries 4a3ebe7a-35e6-4371-94f7-50db34c0167a Falcon JSDF [Falcon JSDF] 2016-03-01T02:44:51.265Z 2639 210 True False https://api.fuelrats.com/leaderboard-entries/4...
4 leaderboard-entries 9257c634-8c79-4d0c-b64c-f9acf1672f3a JERRYCLARK [JERRYCLARK] 2017-05-26T15:58:34.000Z 2452 229 False False https://api.fuelrats.com/leaderboard-entries/9...
... ... ... ... ... ... ... ... ... ... ...
3455 leaderboard-entries 55e8f803-9005-4c99-a5c3-d8ea49ac365a Boonlike [Boonlike] 2020-11-28T20:52:22.250Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/5...
3456 leaderboard-entries 46da00f1-0ac4-475c-9fc9-9da81624a5cd gamerjackiechan2 [gamerjackiechan2] 2018-06-04T18:51:10.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/4...
3457 leaderboard-entries e1d78729-3813-451e-95ff-f64c76bce4ba DutchProjectz [DutchProjectz] 2015-09-06T02:46:26.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/e...
3458 leaderboard-entries 57417a4e-b7a9-4612-8d66-59b69b33447e dalam88 [dalam88] 2017-07-25T14:48:13.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/5...
3459 leaderboard-entries 233a183a-b270-441f-bd55-258511cd9541 Glyc [Glyc, Glyca94] 2021-03-22T19:36:24.121Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/2...
3460 rows × 10 columns