I am trying to scrape data from https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps to make a dataframe with all the players names and their stats (overall rating, position, pac, sho, pas, dri, def, phy), however my rvest cannot detect the information as table.
I tried:
for(i in 1:10) {
page <- read_html(paste("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep=""))
}
StatsTable <- page %>%
html_table(fill=TRUE)
head(StatsTable)
This results in printing out a list() instead of a table. How can edit my for loop that the data is detected by the read_html and html_table the data on the website, so that I can create a dataframe with the player stats?
I also tried it for the first page like this:
first <- read_html("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep="")
first
tab <- first %>%
html_nodes(".padding-0") %>%
html_text()
tab
### Deletes spaces and \n
tab <- gsub(" ", "", tab)
tab <- gsub("\n", " ", tab)
tab
This way I got all data from the first page, however all the information is putted into characters. Maybe if it is possible to extract the names and stats from these characters to make it into a dataframe? How could that be done?
CodePudding user response:
I updated the code so you at once scrape first ten sub-pages in to one dataframe. PLEASE note, that the code for scraping is from @Otto_Kässi answer, so all the credit should go to him!!!
library(rvest)
library(stringr)
library(tidyverse)
url <- "https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps"
p1 <- str_c("https://www.futhead.com/22/players/",'?page=', 1:10)
pages <- paste0(p1,"&level=gold_nif&bin_platform=ps")
df <- tibble(player = character(),
overall= character(),
pac = character(),
sho = character(),
pas = character(),
dri = character(),
def = character(),
phy = character())
for (i in pages) {
i %>% read_html() %>%
html_nodes("[class='list-group list-group-table player-group-table']") %>%
html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>%
html_attr("alt") -> player_names
i %>% read_html() %>%
html_nodes("[class='player-right text-center hidden-xs']") %>%
html_nodes("[class='value']") %>%
html_text() %>%
matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats
player_names %>% as_tibble() -> player_names
names(player_names) <- 'player'
substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
player_names$overall <- overall
as_tibble(player_stats) -> player_stats
names(player_stats) <- c('pac','sho','pas','dri','def','phy')
#bind everything together
bind_cols(player_names, player_stats) -> players
df <- rbind(df, players)
rm(player_names); rm(player_stats); rm(players)
}
df <- df %>% mutate(player = str_replace_all(player, "[:digit:]", "")) %>% mutate_at(vars(2:7), as.numeric)
If you run the whole code at once, it should work!
CodePudding user response:
I do not think you can accomplish what you want using html_table. The table on the page you are trying to scrape is not a html table element.
You will notice that the thing that looks like a table is actually a <ul >
. You will then need to pick up the info you want using distinct html_node() commands. I.e.
page %>%
html_nodes("[class='list-group list-group-table player-group-table']") %>%
html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>%
html_attr("alt") -> player_names
and
page %>%
html_nodes("[class='player-right text-center hidden-xs']") %>%
html_nodes("[class='value']") %>%
html_text() %>%
matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats
One way to capture player positions is to use gsub()
to find the string between <strong>
and </strong>
from the player-club-league-name
class.
page %>%
html_nodes("[class='list-group list-group-table player-group-table']") %>%
html_nodes("[class='player-club-league-name']") %>%
gsub(".*<strong>(. )</strong>.*", "\\1", .) -> positions
Finally, make everything into a data.frame:
# make player_names into a tibble and extract overall score
library(tidyverse)
player_names %>% as_tibble() -> player_names
names(player_names) <- 'player'
substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
player_names$overall <- overall
# stat names for player_stats
as_tibble(player_stats) -> player_stats
names(player_stats) <- c('pac','sho','pas','dri','def','phy')
#bind everything together
bind_cols(player_names, player_stats) -> players
rm(player_names); rm(player_stats)
Result:
> players
# A tibble: 48 x 8
player overall pac sho pas dri def phy
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Lionel Messi 93 93 85 92 91 95 34 65
2 Robert Lewandowski 92 92 78 92 79 86 44 82
3 C. Ronaldo dos Santos Aveiro 91 91 87 93 82 88 34 75
4 Kevin De Bruyne 91 91 76 86 93 88 64 78
5 Neymar da Silva Santos Jr. 91 91 91 83 86 94 37 63
6 Kylian Mbappé 91 91 97 88 80 92 36 77
7 Harry Kane 90 90 70 91 83 83 47 83
8 N'Golo Kanté 90 90 78 66 75 82 87 83
9 Mohamed Salah 89 89 90 87 81 90 45 75
10 Karim Benzema 89 89 76 86 81 87 39 77
# … with 38 more rows