Web scraping with Python and Pandas - Pagination-CodePudding

With this short code I can get data from the table:

import pandas as pd

df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)

df[0].to_csv('2023_I_M_800.csv')

I am trying to get data from all pages or a determinated number of them but since this website doesn't use lu or li elementsIdon'tknow exacxtly how to built it.

Any help or idea would be appreciated.

CodePudding user response：

Since the url contains the page number, why not just making a loop and concat ?

`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular

import pandas as pd

F, L = 1, 4 # first and last pages

dico = {}
for page in range(F, L 1):
    url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    sub_df.insert(0, "page_number", page)
    dico[page] = sub_df
    
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv

NB : You can access each sub_df separately by using key-indexing notation : dico[num_page].

Output :

print(out)

     page_number  Rank  ...         Date Results Score
0              1     1  ...  22 JAN 2023          1230
1              1     2  ...  22 JAN 2023          1204
2              1     3  ...  29 JAN 2023          1204
3              1     4  ...  27 JAN 2023          1192
4              1     5  ...  28 JAN 2023          1189
..           ...   ...  ...          ...           ...
395            4   394  ...  21 JAN 2023           977
396            4   394  ...  28 JAN 2023           977
397            4   398  ...  27 JAN 2023           977
398            4   399  ...  28 JAN 2023           977
399            4   399  ...  29 JAN 2023           977

[400 rows x 11 columns]

CodePudding user response：

Try this:

for page in range(1, 10):
    df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)

    df[0].to_csv(f'2023_I_M_800_page_{page}.csv')