I'm fetching data from https://www.wowprogress.com/ and am using Pandas to do it. I read the HTML into a dataframe, and counted the tables on the page. The table I want is the first table with indexes from 1 through 20, and so on.
The issue is that there's a "next" button on the page that you can press... but the URL doesn't change at all.
The code I used below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize
table_wow = pd.read_html('https://www.wowprogress.com/')
print (table_wow)
This shows the first table on the page from my end. But I cannot figure out how to simulate pressing the next button and getting the rest of the data on pages 2 through whatever page I want.
Any tips on how this can be done, or what I may be missing?
CodePudding user response:
When checking network activity you can see that the next page is loaded from https://www.wowprogress.com/pve/rating/next/0/rating/
, with the integer after /next/
increasing with the page numbers. So you can loop through the subsequent pages:
import pandas as pd
import time
table_wow = pd.read_html('https://www.wowprogress.com/')[1]
max_page = 10
for i in range(0,max_page):
table = pd.read_html(f'https://www.wowprogress.com/pve/rating/next/{i}/rating/')[1]
table_wow = table_wow.append(table, ignore_index=True)
time.sleep(1.5)
CodePudding user response:
Here is the the working example where pagination is made from api url as follows:
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
api_url = ['https://www.wowprogress.com/pve/rating/next/' str(x) '/rating' for x in range(1,5)]
for url in api_url:
req = requests.get(url,headers=headers)
wiki_table = pd.read_html(req.text, attrs = {"class":"rating"} )
df = wiki_table[0]#.to_csv('score.csv',index = False)
print(df)