I need help getting links for every page that doesn't include ?page=number after the link-CodePudding

I am trying to get the links from all the pages on https://apexranked.com/. I tried using

url = 'https://apexranked.com/'

page = 1 

while page != 121: 
    url = f'https://apexranked.com/?page={page}'
    print(url) 
    page = page   1

however, if you click on the page numbers it doesn't include a https://apexranked.com/?page=number as you see from https://www.mlb.com/stats/?page=2. How would I go about accessing and getting the links from all pages if the page doesn't include ?page=number after the link?

CodePudding user response：

The page is not reloading when you click on page 2. Instead, it is firing a GET request to the website's backend. The request is being sent to : https://apexranked.com/wp-admin/admin-ajax.php
In addition, several parameters are parsed directly onto the previous url.
?action=get_player_data&page=3&total_pages=195&_=1657230896643

Parameters :

action: As the endpoint can handle several purpose, you must indicate the performed action. Surely a mandatory parameter, don't omit it.
page: indicates the requested page (i.e the index you're iteraring over).
total_pages: indicates the total number of page (maybe it can be omitted, otherwise you can scrap it on the main page)
_: this one corresponds to an unix timestamp, same idea as above, try to omit and see what happens. Otherwise you can get a unix timestamp quite easily with time.time()

Once you get a response, it yields a rendered HTML, maybe try to set Accept: application/json field in request headers to get a Json, but that's just a detail.

All these informations wrapped up:

import requests
import time

# Issued from a previous scraping on the main page
total_pages = 195

params = {
    "total_pages": total_pages,
    "_": round(time.time() * 1000),
    "action": "get_player_data"
}

# Make sure to include all mandatory fields
headers = {
    ...
}

for k in range(1, total_pages   1):
     params['page'] = k
     res = requests.get(url, headers=headers, params=params)
     # Make your thing :)

CodePudding user response：

Using API url aka AJAX url, scrape the desired data static way where you can make the pagination using for loop along with range function easily

Example:

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://apexranked.com/wp-admin/admin-ajax.php?action=get_player_data&page={page}&total_pages=195&_=1657230965917"
lst=[]
for page in range(1,196):
    res=requests.get(url.format(page=page))
    soup = BeautifulSoup(res.content, "lxml")
    for name in soup.select('[] tbody tr'):
        name=name.select_one('.table-player-name a')
        name=name.get_text() if name else None
        lst.append({'name':name})

df= pd.DataFrame(lst)
print(df)

CodePudding user response：

I don't really get what you mean but if you for example wanna get the raw text u can do it with requests

import requests
# A loop that will keep going until the page is not found.
while(requests.get(f"https://apexranked.com/?page={page}").status_code != 404):
    #scrap content e.g whole page 
    print(requests.get(f"https://apexranked.com/?page={page}").text)
    page = page   1

you can also replace print(requests.get(f"https://apexranked.com/?page={page}").text)with the method you wanna use to get the data