Is it impossible to scrape LOL.FANDOM?-CodePudding

I am trying to scrape lol.fandom (cblol, Brazilian, stats or any stats really tbh) for a college project but all I get from this website is "NaN". I have no idea how to get around it.

I failed to locate any API, also. Can someone help me to scrape this? If I have to use another language, no problem, I can learn it. If I shouldn't be scraping this website, how can I know in the future a sign that "the website doesn't want to be scraped"?

Code from "Match_History":

import requests
from bs4 import BeautifulSoup
import pandas as pd

cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2/Match_History"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)

cblol_table = soup.select('table.wikitable')
matches_cblol = pd.read_html(data.text, match="Tournament") [0]
matches_cblol

Result - a bunch of NaNs:

Tournament: CBLOL/2022 Season/Split 2; Limit: 200 - Open As Query Date P Blue Red Winner Bans Bans.1 Picks Picks.1 Blue Roster Red Roster SB VOD 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 2022-08-07 12.14 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN Robo, Croc, tinowns, Brance, Ceos fNb, Goot, Envy, Netuno, RedBert SB VOD 2 2022-08-07 12.14 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN KiaRi, Disamis, Krastyel, Matsukaze, Cavalo Zecas, Erasus, evr0t, NinjaKiwi, Mido SB VOD 3 2022-08-07 12.14 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN Hidan, Yampi, NOsFerus, micaO, Jockster GUIGO, Aegis, Grevthar, TitaN, JoJo SB VOD 4 2022-08-07 12.14 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN Tay, Ranger, Tutsz, Flare, Wos Wizer, CarioK, dyNquedo, Trigo, Damage SB VOD ... ... ... ... ... ... ... ... ... ... ... ... ... ... 86 2022-06-11 12.10 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN Tay, Geum go, Tutsz, Flare, Kuri Robo, Croc, tinowns, Brance, Ceos SB VOD 87 2022-06-11 12.10 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN fNb, Goot, Envy, Netuno, RedBert Parang, Wiz, hauz, DudsTheBoy, Scuro SB VOD 88 2022-06-11 12.10 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN KiaRi, Disamis, Krastyel, Matsukaze, Cavalo DoRun, Hugato, Anyyy, Celo, Sive SB VOD 89 2022-06-11 12.10 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN Hidan, Yampi, NOsFerus, micaO, Jockster Trap, Minerva, Goku, NinjaKiwi, Mocha SB VOD 90 2022-06-11 12.10 ⁠⁠ ⁠⁠ ⁠⁠ NaN NaN NaN NaN Wizer, CarioK, dyNquedo, Trigo, Damage GUIGO, Aegis, Grevthar, TitaN, JoJo SB VOD 91 rows × 13 columns

I also tried scraping other links to see if it was just a "Match_History" problem but when I tried to scrape "wiki/CBLOL/2022_Season/Split_2", for example, a more general view on the tournament than "wiki/CBLOL/2022_Season/Split_2/Match_History:

from bs4 import BeautifulSoup
import requests

cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)

cblol_table2 = soup.select('table.wikitable')
cblol_table2[0:8]

This was the furthest I could get from this URL. I can't get pandas to show me a table after this steps if my life depended on it.

Please help.

CodePudding user response：

I hope I can help you.
But your code ran perfectly in my notebook, so the problem is something else, perhaps any of the libraries not installed properly: BeautifulSoup, pandas, or requests

print(matches_cblol)

   Tournament: CBLOL/2022 Season/Split 2; Limit: 200 - Open As Query                                                                                                                                                 
                                                                Date      P Blue  Red Winner Bans Bans.1 Picks Picks.1                                  Blue Roster                              Red Roster   SB  VOD
0                                                                NaN    NaN  NaN  NaN    NaN  NaN    NaN   NaN     NaN                                          NaN                                     NaN  NaN  NaN
1                                                         2022-08-07  12.14   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN            Robo, Croc, tinowns, Brance, Ceos        fNb, Goot, Envy, Netuno, RedBert   SB  VOD
2                                                         2022-08-07  12.14   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN  KiaRi, Disamis, Krastyel, Matsukaze, Cavalo   Zecas, Erasus, evr0t, NinjaKiwi, Mido   SB  VOD
3                                                         2022-08-07  12.14   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN      Hidan, Yampi, NOsFerus, micaO, Jockster     GUIGO, Aegis, Grevthar, TitaN, JoJo   SB  VOD
4                                                         2022-08-07  12.14   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN               Tay, Ranger, Tutsz, Flare, Wos  Wizer, CarioK, dyNquedo, Trigo, Damage   SB  VOD
..                                                               ...    ...  ...  ...    ...  ...    ...   ...     ...                                          ...                                     ...  ...  ...
86                                                        2022-06-11  12.10   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN             Tay, Geum go, Tutsz, Flare, Kuri       Robo, Croc, tinowns, Brance, Ceos   SB  VOD
87                                                        2022-06-11  12.10   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN             fNb, Goot, Envy, Netuno, RedBert    Parang, Wiz, hauz, DudsTheBoy, Scuro   SB  VOD
88                                                        2022-06-11  12.10   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN  KiaRi, Disamis, Krastyel, Matsukaze, Cavalo        DoRun, Hugato, Anyyy, Celo, Sive   SB  VOD
89                                                        2022-06-11  12.10   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN      Hidan, Yampi, NOsFerus, micaO, Jockster   Trap, Minerva, Goku, NinjaKiwi, Mocha   SB  VOD
90                                                        2022-06-11  12.10   ⁠⁠   ⁠⁠     ⁠⁠  NaN    NaN   NaN     NaN       Wizer, CarioK, dyNquedo, Trigo, Damage     GUIGO, Aegis, Grevthar, TitaN, JoJo   SB  VOD

[91 rows x 13 columns]

I also tried the second code, and the output is a dataframe with 20 rows:

cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)

cblol_table2 = soup.select('table.wikitable')
cblol_table2 = pd.read_html(data.text, match="Tournament")[0]
print(cblol_table2)

        CBLOL 2022 Split 2                            CBLOL 2022 Split 2.1
0                      NaN                                             NaN
1                      NaN                                             NaN
2                      NaN                                             NaN
3                      NaN                                             NaN
4   Tournament Information                          Tournament Information
5                Organizer                                      Riot Games
6                 Rulebook                                        Rulebook
7                   Format                                     Round Robin
8         Location & Dates                                Location & Dates
9                   Region                                        BRBrazil
10              Event Type                                          Online
11                 Country                                          Brazil
12              Start Date                                      2022-06-11
13                End Date                                      2022-08-07
14               Broadcast                                       Broadcast
15                 Streams                           Twitch YouTube NimoTV
16                Schedule  Spoiler-Free ScheduleExport to Google Calendar
17    Social Media & Links                            Social Media & Links
18                     NaN                                             NaN
19                     NaN                                             NaN

CodePudding user response：

Hmmm... I was not clear about my problem. Sorry.

Yes, the code is running perfectly, but no relevant data is coming out of it

(Nothing like the tutorials or examples I've looked into).

I don't know what direction I should take to solve it and get relevant data from this website.

Example of Relevant Data: "Blue Side" should have the name of a team instead of "NaN", "Red Side" same thing, and "Winner" the same thing.

"Bans" and "Picks" might be trickier because there are a lot of "Champions" in the game so I was hoping I could get at least the blue/red/winner team info out of this table.

Should I try Selenium, R Script or other stuff?

EDIT: I made some progress

import requests
from bs4 import BeautifulSoup
import pandas as pd

cblol_url = "https://lol.fandom.com/wiki/CBLOL/2022_Season/Split_2"
data = requests.get(cblol_url)
soup = BeautifulSoup(data.text)

cblol_table = soup.select('#md-table')
cblol_table = pd.read_html(data.text, match="Score")[0]
cblol_table.columns = cblol_table.columns.droplevel(0)

Here I analyzed using .dtypes and the table "Score" is an object due to the nature "1 - 0". Apparently, I need this value to be a float or a string.

So I tried splitting the columns with pandas and I end up getting a giant error that ends with "KeyError: 'Score'.

cblol_table['Score'].str.split(' - ', expand=True)

How do I proceed from here?