I am trying to scrape lol.fandom (cblol, Brazilian, stats or any stats really tbh) for a college project but all I get from this website is "NaN". I have no idea how to get around it.
I failed to locate any API, also. Can someone help me to scrape this? If I have to use another language, no problem, I can learn it. If I shouldn't be scraping this website, how can I know in the future a sign that "the website doesn't want to be scraped"?
Code from "Match_History":
import requests
from bs4 import BeautifulSoup
import pandas as pd
cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2/Match_History"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)
cblol_table = soup.select('table.wikitable')
matches_cblol = pd.read_html(data.text, match="Tournament") [0]
matches_cblol
Result - a bunch of NaNs:
Tournament: CBLOL/2022 Season/Split 2; Limit: 200 - Open As Query Date P Blue Red Winner Bans Bans.1 Picks Picks.1 Blue Roster Red Roster SB VOD 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 2022-08-07 12.14 NaN NaN NaN NaN Robo, Croc, tinowns, Brance, Ceos fNb, Goot, Envy, Netuno, RedBert SB VOD 2 2022-08-07 12.14 NaN NaN NaN NaN KiaRi, Disamis, Krastyel, Matsukaze, Cavalo Zecas, Erasus, evr0t, NinjaKiwi, Mido SB VOD 3 2022-08-07 12.14 NaN NaN NaN NaN Hidan, Yampi, NOsFerus, micaO, Jockster GUIGO, Aegis, Grevthar, TitaN, JoJo SB VOD 4 2022-08-07 12.14 NaN NaN NaN NaN Tay, Ranger, Tutsz, Flare, Wos Wizer, CarioK, dyNquedo, Trigo, Damage SB VOD ... ... ... ... ... ... ... ... ... ... ... ... ... ... 86 2022-06-11 12.10 NaN NaN NaN NaN Tay, Geum go, Tutsz, Flare, Kuri Robo, Croc, tinowns, Brance, Ceos SB VOD 87 2022-06-11 12.10 NaN NaN NaN NaN fNb, Goot, Envy, Netuno, RedBert Parang, Wiz, hauz, DudsTheBoy, Scuro SB VOD 88 2022-06-11 12.10 NaN NaN NaN NaN KiaRi, Disamis, Krastyel, Matsukaze, Cavalo DoRun, Hugato, Anyyy, Celo, Sive SB VOD 89 2022-06-11 12.10 NaN NaN NaN NaN Hidan, Yampi, NOsFerus, micaO, Jockster Trap, Minerva, Goku, NinjaKiwi, Mocha SB VOD 90 2022-06-11 12.10 NaN NaN NaN NaN Wizer, CarioK, dyNquedo, Trigo, Damage GUIGO, Aegis, Grevthar, TitaN, JoJo SB VOD 91 rows × 13 columns
I also tried scraping other links to see if it was just a "Match_History" problem but when I tried to scrape "wiki/CBLOL/2022_Season/Split_2", for example, a more general view on the tournament than "wiki/CBLOL/2022_Season/Split_2/Match_History:
from bs4 import BeautifulSoup
import requests
cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)
cblol_table2 = soup.select('table.wikitable')
cblol_table2[0:8]
This was the furthest I could get from this URL. I can't get pandas to show me a table after this steps if my life depended on it.
Please help.
CodePudding user response:
I hope I can help you.
But your code ran perfectly in my notebook, so the problem is something else, perhaps any of the libraries not installed properly: BeautifulSoup
, pandas
, or requests
print(matches_cblol)
Tournament: CBLOL/2022 Season/Split 2; Limit: 200 - Open As Query
Date P Blue Red Winner Bans Bans.1 Picks Picks.1 Blue Roster Red Roster SB VOD
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2022-08-07 12.14 NaN NaN NaN NaN Robo, Croc, tinowns, Brance, Ceos fNb, Goot, Envy, Netuno, RedBert SB VOD
2 2022-08-07 12.14 NaN NaN NaN NaN KiaRi, Disamis, Krastyel, Matsukaze, Cavalo Zecas, Erasus, evr0t, NinjaKiwi, Mido SB VOD
3 2022-08-07 12.14 NaN NaN NaN NaN Hidan, Yampi, NOsFerus, micaO, Jockster GUIGO, Aegis, Grevthar, TitaN, JoJo SB VOD
4 2022-08-07 12.14 NaN NaN NaN NaN Tay, Ranger, Tutsz, Flare, Wos Wizer, CarioK, dyNquedo, Trigo, Damage SB VOD
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
86 2022-06-11 12.10 NaN NaN NaN NaN Tay, Geum go, Tutsz, Flare, Kuri Robo, Croc, tinowns, Brance, Ceos SB VOD
87 2022-06-11 12.10 NaN NaN NaN NaN fNb, Goot, Envy, Netuno, RedBert Parang, Wiz, hauz, DudsTheBoy, Scuro SB VOD
88 2022-06-11 12.10 NaN NaN NaN NaN KiaRi, Disamis, Krastyel, Matsukaze, Cavalo DoRun, Hugato, Anyyy, Celo, Sive SB VOD
89 2022-06-11 12.10 NaN NaN NaN NaN Hidan, Yampi, NOsFerus, micaO, Jockster Trap, Minerva, Goku, NinjaKiwi, Mocha SB VOD
90 2022-06-11 12.10 NaN NaN NaN NaN Wizer, CarioK, dyNquedo, Trigo, Damage GUIGO, Aegis, Grevthar, TitaN, JoJo SB VOD
[91 rows x 13 columns]
I also tried the second code, and the output is a dataframe with 20 rows:
cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)
cblol_table2 = soup.select('table.wikitable')
cblol_table2 = pd.read_html(data.text, match="Tournament")[0]
print(cblol_table2)
CBLOL 2022 Split 2 CBLOL 2022 Split 2.1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 Tournament Information Tournament Information
5 Organizer Riot Games
6 Rulebook Rulebook
7 Format Round Robin
8 Location & Dates Location & Dates
9 Region BRBrazil
10 Event Type Online
11 Country Brazil
12 Start Date 2022-06-11
13 End Date 2022-08-07
14 Broadcast Broadcast
15 Streams Twitch YouTube NimoTV
16 Schedule Spoiler-Free ScheduleExport to Google Calendar
17 Social Media & Links Social Media & Links
18 NaN NaN
19 NaN NaN
CodePudding user response:
Hmmm... I was not clear about my problem. Sorry.
Yes, the code is running perfectly, but no relevant data is coming out of it
(Nothing like the tutorials or examples I've looked into).
I don't know what direction I should take to solve it and get relevant data from this website.
Example of Relevant Data: "Blue Side" should have the name of a team instead of "NaN", "Red Side" same thing, and "Winner" the same thing.
"Bans" and "Picks" might be trickier because there are a lot of "Champions" in the game so I was hoping I could get at least the blue/red/winner team info out of this table.
Should I try Selenium, R Script or other stuff?
EDIT: I made some progress
import requests
from bs4 import BeautifulSoup
import pandas as pd
cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)
cblol_table = soup.select('#md-table')
cblol_table = pd.read_html(data.text, match="Score")[0]
cblol_table.columns = cblol_table.columns.droplevel(0)
Here I analyzed using .dtypes and the table "Score" is an object due to the nature "1 - 0". Apparently, I need this value to be a float or a string.
So I tried splitting the columns with pandas and I end up getting a giant error that ends with "KeyError: 'Score'.
cblol_table['Score'].str.split(' - ', expand=True)
How do I proceed from here?