Home > Blockchain >  Unable to load tables with "Load more" options in a website using Python
Unable to load tables with "Load more" options in a website using Python

Time:12-04

Need to scrape the full table from this site with "Load more" option.

As of now when I`m scraping , I only get the one that shows up by default on when loading the page.

import pandas as pd
import requests
from six.moves import urllib

URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
header = {'Accept-Language': "en-US,en;q=0.9",
          'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                        "(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
          }

resp2 = requests.get(url=URL2, headers=header).text

tables2 = pd.read_html(resp2)
overview_table2= tables2[0]
overview_table2
Player Name Team Matches Goals Time Played Unnamed: 5
0 Jorge Pereyra Diaz Mumbai City 9 6 538 Mins NaN
1 Cleiton Silva SC East Bengal 8 5 707 Mins NaN
2 Abdenasser El Khayati Chennaiyin FC 5 4 231 Mins NaN
3 Lallianzuala Chhangte Mumbai City 9 4 737 Mins NaN
4 Nandhakumar Sekar Odisha 8 4 673 Mins NaN
5 Ivan Kalyuzhnyi Kerala Blasters 7 4 428 Mins NaN
6 Bipin Singh Mumbai City 9 4 806 Mins NaN
7 Noah Sadaoui Goa 8 4 489 Mins NaN
8 Diego Mauricio Odisha 8 3 526 Mins NaN
9 Pedro Martin Odisha 8 3 263 Mins NaN
10 Dimitri Petratos ATK Mohun Bagan 6 3 517 Mins NaN
11 Petar Sliskovic Chennaiyin FC 8 3 662 Mins NaN
12 Holicharan Narzary Hyderabad 9 3 705 Mins NaN
13 Dimitrios Diamantakos Kerala Blasters 7 3 529 Mins NaN
14 Alberto Noguera Mumbai City 9 3 371 Mins NaN
15 Jerry Mawihmingthanga Odisha 8 3 611 Mins NaN
16 Hugo Boumous ATK Mohun Bagan 7 2 580 Mins NaN
17 Javi Hernandez Bengaluru 6 2 397 Mins NaN
18 Borja Herrera Hyderabad 9 2 314 Mins NaN
19 Mohammad Yasir Hyderabad 9 2 777 Mins NaN
20 Load More.... Load More.... Load More.... Load More.... Load More.... Load More....

But I need the full table , including the data under "Load more", please help.

CodePudding user response:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}


def main(url):
    params = {
        "action": "stats",
        "league_id": "750",
        "limit": "300",
        "offset": "0",
        "part": "leagues",
        "season_id": "2022",
        "section": "football",
        "stats_type": "player",
        "tab": "overview"
    }
    r = requests.get(url, headers=headers, params=params)
    soup = BeautifulSoup(r.text, 'lxml')
    goal = [(x['title'], *[i.get_text(strip=True) for i in x.find_all_next('td', limit=4)])
            for x in soup.select('a.player_link')]
    df = pd.DataFrame(
        goal, columns=['Name', 'Team', 'Matches', 'Goals', 'Time Played'])
    print(df)


main('https://www.mykhel.com/src/index.php')

Output:

                      Name              Team Matches Goals Time Played
0       Jorge Pereyra Diaz       Mumbai City       9     6    538 Mins
1            Cleiton Silva    SC East Bengal       8     5    707 Mins
2    Abdenasser El Khayati     Chennaiyin FC       5     4    231 Mins
3    Lallianzuala Chhangte       Mumbai City       9     4    737 Mins
4        Nandhakumar Sekar            Odisha       8     4    673 Mins
..                     ...               ...     ...   ...         ...
268          Sarthak Golui    SC East Bengal       6     0    402 Mins
269          Ivan Gonzalez    SC East Bengal       8     0    683 Mins
270       Michael Jakobsen  NorthEast United       8     0    676 Mins
271       Pratik Chowdhary     Jamshedpur FC       6     0    495 Mins
272         Chungnunga Lal    SC East Bengal       8     0    720 Mins

[273 rows x 5 columns]

CodePudding user response:

This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.

I'll stick to working with dynamically loaded pages with Selenium browser automation suite.

Installation

To get started, you'll need to install bindings:

pip install selenium

You seem to already have , but for anyone who might come across this answer, we'll also need it and html5lib, we'll need them later to parse the table:

pip install html5lib BeautifulSoup4

Now, for to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:

Browser Link to driver download
Chrome: https://sites.google.com/chromium.org/driver/
Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox: https://github.com/mozilla/geckodriver/releases
Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Opera: https://github.com/operasoftware/operachromiumdriver/releases

You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser

  • Related