Home > Back-end >  Web scraping with Python and Pandas - Pagination
Web scraping with Python and Pandas - Pagination

Time:01-31

With this short code I can get data from the table:

import pandas as pd

df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)

df[0].to_csv('2023_I_M_800.csv')

I am trying to get data from all pages or a determinated number of them but since this website doesn't use lu or li elementsIdon'tknow exacxtly how to built it.

Any help or idea would be appreciated.

CodePudding user response:

Since the url contains the page number, why not just making a loop and concat ?

`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular

import pandas as pd
​
F, L = 1, 4 # first and last pages
​
dico = {}
for page in range(F, L 1):
    url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    sub_df.insert(0, "page_number", page)
    dico[page] = sub_df
    ​
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv

NB : You can access each sub_df separately by using key-indexing notation : dico[num_page].

Output :

print(out)

     page_number  Rank  ...         Date Results Score
0              1     1  ...  22 JAN 2023          1230
1              1     2  ...  22 JAN 2023          1204
2              1     3  ...  29 JAN 2023          1204
3              1     4  ...  27 JAN 2023          1192
4              1     5  ...  28 JAN 2023          1189
..           ...   ...  ...          ...           ...
395            4   394  ...  21 JAN 2023           977
396            4   394  ...  28 JAN 2023           977
397            4   398  ...  27 JAN 2023           977
398            4   399  ...  28 JAN 2023           977
399            4   399  ...  29 JAN 2023           977

[400 rows x 11 columns]

CodePudding user response:

Try this:

for page in range(1, 10):
    df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)

    df[0].to_csv(f'2023_I_M_800_page_{page}.csv')
  • Related