Need to scrape the full table from this site with "Load more" option.
As of now when I`m scraping , I only get the one that shows up by default on when loading the page.
import pandas as pd
import requests
from six.moves import urllib
URL2 = "https://www.mykhel.com/football/indian-super-league-player-stats-l750/"
header = {'Accept-Language': "en-US,en;q=0.9",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
resp2 = requests.get(url=URL2, headers=header).text
tables2 = pd.read_html(resp2)
overview_table2= tables2[0]
overview_table2
Player Name | Team | Matches | Goals | Time Played | Unnamed: 5 | |
---|---|---|---|---|---|---|
0 | Jorge Pereyra Diaz | Mumbai City | 9 | 6 | 538 Mins | NaN |
1 | Cleiton Silva | SC East Bengal | 8 | 5 | 707 Mins | NaN |
2 | Abdenasser El Khayati | Chennaiyin FC | 5 | 4 | 231 Mins | NaN |
3 | Lallianzuala Chhangte | Mumbai City | 9 | 4 | 737 Mins | NaN |
4 | Nandhakumar Sekar | Odisha | 8 | 4 | 673 Mins | NaN |
5 | Ivan Kalyuzhnyi | Kerala Blasters | 7 | 4 | 428 Mins | NaN |
6 | Bipin Singh | Mumbai City | 9 | 4 | 806 Mins | NaN |
7 | Noah Sadaoui | Goa | 8 | 4 | 489 Mins | NaN |
8 | Diego Mauricio | Odisha | 8 | 3 | 526 Mins | NaN |
9 | Pedro Martin | Odisha | 8 | 3 | 263 Mins | NaN |
10 | Dimitri Petratos | ATK Mohun Bagan | 6 | 3 | 517 Mins | NaN |
11 | Petar Sliskovic | Chennaiyin FC | 8 | 3 | 662 Mins | NaN |
12 | Holicharan Narzary | Hyderabad | 9 | 3 | 705 Mins | NaN |
13 | Dimitrios Diamantakos | Kerala Blasters | 7 | 3 | 529 Mins | NaN |
14 | Alberto Noguera | Mumbai City | 9 | 3 | 371 Mins | NaN |
15 | Jerry Mawihmingthanga | Odisha | 8 | 3 | 611 Mins | NaN |
16 | Hugo Boumous | ATK Mohun Bagan | 7 | 2 | 580 Mins | NaN |
17 | Javi Hernandez | Bengaluru | 6 | 2 | 397 Mins | NaN |
18 | Borja Herrera | Hyderabad | 9 | 2 | 314 Mins | NaN |
19 | Mohammad Yasir | Hyderabad | 9 | 2 | 777 Mins | NaN |
20 | Load More.... | Load More.... | Load More.... | Load More.... | Load More.... | Load More.... |
But I need the full table , including the data under "Load more", please help.
CodePudding user response:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}
def main(url):
params = {
"action": "stats",
"league_id": "750",
"limit": "300",
"offset": "0",
"part": "leagues",
"season_id": "2022",
"section": "football",
"stats_type": "player",
"tab": "overview"
}
r = requests.get(url, headers=headers, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x['title'], *[i.get_text(strip=True) for i in x.find_all_next('td', limit=4)])
for x in soup.select('a.player_link')]
df = pd.DataFrame(
goal, columns=['Name', 'Team', 'Matches', 'Goals', 'Time Played'])
print(df)
main('https://www.mykhel.com/src/index.php')
Output:
Name Team Matches Goals Time Played
0 Jorge Pereyra Diaz Mumbai City 9 6 538 Mins
1 Cleiton Silva SC East Bengal 8 5 707 Mins
2 Abdenasser El Khayati Chennaiyin FC 5 4 231 Mins
3 Lallianzuala Chhangte Mumbai City 9 4 737 Mins
4 Nandhakumar Sekar Odisha 8 4 673 Mins
.. ... ... ... ... ...
268 Sarthak Golui SC East Bengal 6 0 402 Mins
269 Ivan Gonzalez SC East Bengal 8 0 683 Mins
270 Michael Jakobsen NorthEast United 8 0 676 Mins
271 Pratik Chowdhary Jamshedpur FC 6 0 495 Mins
272 Chungnunga Lal SC East Bengal 8 0 720 Mins
[273 rows x 5 columns]
CodePudding user response:
This is a dynamically loaded page, so you can not parse all the contents without hitting a button.
Well… may be you can with XHR or smth like that, may be someone will contribute to the answers here.
I'll stick to working with dynamically loaded pages with Selenium browser automation suite.
Installation
To get started, you'll need to install selenium bindings:
pip install selenium
You seem to already have beautifulsoup, but for anyone who might come across this answer, we'll also need it and html5lib
, we'll need them later to parse the table:
pip install html5lib BeautifulSoup4
Now, for selenium to work you'll need a driver installed for a browser of your choice. To get the drivers you may use Selenium Manager, Driver Management Software or download the drivers manually. The above mentioned options are something new, I have my manually downloaded drivers for ages, so I'll stick to them. I'll duplicate here the download links:
Browser | Link to driver download |
---|---|
Chrome: | https://sites.google.com/chromium.org/driver/ |
Edge: | https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ |
Firefox: | https://github.com/mozilla/geckodriver/releases |
Safari: | https://webkit.org/blog/6900/webdriver-support-in-safari-10/ |
Opera: | https://github.com/operasoftware/operachromiumdriver/releases |
You can use any browser, e.g. Brave browser, Yandex Browser, basically any Chromium based browser of your choice or even Tor browser