Home > front end >  How do you read HTML in Pandas with a dynamic table with no updating URL?
How do you read HTML in Pandas with a dynamic table with no updating URL?

Time:11-14

I'm fetching data from https://www.wowprogress.com/ and am using Pandas to do it. I read the HTML into a dataframe, and counted the tables on the page. The table I want is the first table with indexes from 1 through 20, and so on.

The issue is that there's a "next" button on the page that you can press... but the URL doesn't change at all.

The code I used below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize

table_wow = pd.read_html('https://www.wowprogress.com/')
print (table_wow)

This shows the first table on the page from my end. But I cannot figure out how to simulate pressing the next button and getting the rest of the data on pages 2 through whatever page I want.

Any tips on how this can be done, or what I may be missing?

CodePudding user response:

When checking network activity you can see that the next page is loaded from https://www.wowprogress.com/pve/rating/next/0/rating/, with the integer after /next/ increasing with the page numbers. So you can loop through the subsequent pages:

import pandas as pd
import time

table_wow = pd.read_html('https://www.wowprogress.com/')[1]

max_page = 10

for i in range(0,max_page):
    table = pd.read_html(f'https://www.wowprogress.com/pve/rating/next/{i}/rating/')[1]
    table_wow = table_wow.append(table, ignore_index=True)
    time.sleep(1.5)

CodePudding user response:

Here is the the working example where pagination is made from api url as follows:

import requests
import pandas as pd

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}

api_url = ['https://www.wowprogress.com/pve/rating/next/' str(x) '/rating' for x in range(1,5)]

for url in api_url:
    req = requests.get(url,headers=headers)

    wiki_table = pd.read_html(req.text, attrs = {"class":"rating"} )

    df = wiki_table[0]#.to_csv('score.csv',index = False)

    print(df)
  • Related