Home > Back-end >  How to get the next page on a website using requests and bs4
How to get the next page on a website using requests and bs4

Time:01-27

I want to web scrape the info from a leader board on this website, but it only shows 25 entries at once and the url doesn't change when you press the "next" button to get the next 25 entries. What I want to do is to get all the "number of rescues" from all the entries in the leader board so I can check if the "number of rescues" follows a pareto-distribution (meaning that, for example, the top 20 % of all people are responsible for 80 % of all rescues).

So I can get the first 25 entries no problem like this:

import requests 
from bs4 import BeautifulSoup

url = 'https://fuelrats.com/leaderboard'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

rows = soup.findAll('div',{'class':"rt-tr-group",'role':"rowgroup"})

print(len(rows))

but after that, I don't know how to press "next" from python and then get the next 25 entries. How can I do that? Is it even possible with just requests and bs4?

CodePudding user response:

There is another way of getting that data, by scraping the API endpoint that page is being hydrated from. The API url can be found in Dev Tools - Network tab, under 'XHR' section.

import requests
import pandas as pd

r = requests.get('https://fuelrats.com/api/fr/leaderboard?page[offset]=0&page[limit]=5000')
df = pd.json_normalize(r.json()['data'])
print(df)

Result in terminal:

    type    id  attributes.preferredName    attributes.ratNames attributes.joinedAt attributes.rescueCount  attributes.codeRedCount attributes.isDispatch   attributes.isEpic   links.self
0   leaderboard-entries e9520722-02d2-4d69-9dba-c4e3ea727b14    Aleethia    [Aanyath, Aleethia, Konisho, Ravenov]   2015-07-28T00:06:53.000Z    15467   2228    False   False   https://api.fuelrats.com/leaderboard-entries/e...
1   leaderboard-entries 5ed94356-bdcc-4139-9208-3cec320d51c9    Elysiumchains   [Alysianfolly, Elysianfields, Elysiumchains, E...   2018-06-22T00:07:59.000Z    8476    810 True    False   https://api.fuelrats.com/leaderboard-entries/5...
2   leaderboard-entries bb1d04cd-2889-4994-ad6a-3fb881a5d243    Caleb Dume  [Agamedes, Alcamenes, Andocides, Argonides, Ca...   2019-09-14T03:40:37.101Z    3163    309 True    False   https://api.fuelrats.com/leaderboard-entries/b...
3   leaderboard-entries 4a3ebe7a-35e6-4371-94f7-50db34c0167a    Falcon JSDF [Falcon JSDF]   2016-03-01T02:44:51.265Z    2639    210 True    False   https://api.fuelrats.com/leaderboard-entries/4...
4   leaderboard-entries 9257c634-8c79-4d0c-b64c-f9acf1672f3a    JERRYCLARK  [JERRYCLARK]    2017-05-26T15:58:34.000Z    2452    229 False   False   https://api.fuelrats.com/leaderboard-entries/9...
... ... ... ... ... ... ... ... ... ... ...
3455    leaderboard-entries 55e8f803-9005-4c99-a5c3-d8ea49ac365a    Boonlike    [Boonlike]  2020-11-28T20:52:22.250Z    1   0   False   False   https://api.fuelrats.com/leaderboard-entries/5...
3456    leaderboard-entries 46da00f1-0ac4-475c-9fc9-9da81624a5cd    gamerjackiechan2    [gamerjackiechan2]  2018-06-04T18:51:10.000Z    1   0   False   False   https://api.fuelrats.com/leaderboard-entries/4...
3457    leaderboard-entries e1d78729-3813-451e-95ff-f64c76bce4ba    DutchProjectz   [DutchProjectz] 2015-09-06T02:46:26.000Z    1   0   False   False   https://api.fuelrats.com/leaderboard-entries/e...
3458    leaderboard-entries 57417a4e-b7a9-4612-8d66-59b69b33447e    dalam88 [dalam88]   2017-07-25T14:48:13.000Z    1   0   False   False   https://api.fuelrats.com/leaderboard-entries/5...
3459    leaderboard-entries 233a183a-b270-441f-bd55-258511cd9541    Glyc    [Glyc, Glyca94] 2021-03-22T19:36:24.121Z    1   0   False   False   https://api.fuelrats.com/leaderboard-entries/2...
3460 rows × 10 columns
  • Related