Webscraping Rvest not working, tables not detected-CodePudding

I am trying to scrape data from https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps to make a dataframe with all the players names and their stats (overall rating, position, pac, sho, pas, dri, def, phy), however my rvest cannot detect the information as table.

I tried:

for(i in 1:10) {
  page <- read_html(paste("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep=""))
}

StatsTable <- page %>%
  html_table(fill=TRUE)
head(StatsTable)

This results in printing out a list() instead of a table. How can edit my for loop that the data is detected by the read_html and html_table the data on the website, so that I can create a dataframe with the player stats?

I also tried it for the first page like this:

first <- read_html("https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps",sep="")
first
tab <- first %>%
  html_nodes(".padding-0") %>%
  html_text()
tab
### Deletes spaces and \n
tab <- gsub("  ", "", tab)
tab <- gsub("\n", " ", tab)
tab

This way I got all data from the first page, however all the information is putted into characters. Maybe if it is possible to extract the names and stats from these characters to make it into a dataframe? How could that be done?

CodePudding user response：

I updated the code so you at once scrape first ten sub-pages in to one dataframe. PLEASE note, that the code for scraping is from @Otto_Kässi answer, so all the credit should go to him!!!

library(rvest)
library(stringr)
library(tidyverse)

url <- "https://www.futhead.com/22/players/?page=1&level=gold_nif&bin_platform=ps"
p1 <- str_c("https://www.futhead.com/22/players/",'?page=', 1:10)
pages <- paste0(p1,"&level=gold_nif&bin_platform=ps")

df <- tibble(player = character(),
             overall= character(),
             pac = character(),
             sho = character(),
             pas = character(),
             dri = character(),
             def = character(),
             phy = character())

for (i in pages) {
  i %>% read_html() %>% 
    html_nodes("[class='list-group list-group-table player-group-table']") %>% 
    html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>% 
    html_attr("alt") -> player_names 
  
  i %>% read_html() %>% 
    html_nodes("[class='player-right text-center hidden-xs']") %>% 
    html_nodes("[class='value']") %>% 
    html_text() %>% 
    matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats
  
  player_names %>% as_tibble() -> player_names 
  names(player_names) <- 'player'
  substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
  player_names$overall <- overall
  
  as_tibble(player_stats) -> player_stats
  names(player_stats) <- c('pac','sho','pas','dri','def','phy')
  
  #bind everything together
  bind_cols(player_names, player_stats) -> players
  df <- rbind(df, players)
  rm(player_names); rm(player_stats); rm(players)
} 

df <- df %>% mutate(player = str_replace_all(player, "[:digit:]", "")) %>%  mutate_at(vars(2:7), as.numeric)

If you run the whole code at once, it should work!

CodePudding user response：

I do not think you can accomplish what you want using html_table. The table on the page you are trying to scrape is not a html table element.

You will notice that the thing that looks like a table is actually a <ul >. You will then need to pick up the info you want using distinct html_node() commands. I.e.

    page %>% 
html_nodes("[class='list-group list-group-table player-group-table']") %>% 
html_nodes("[class='player-info']") %>% html_nodes("[class='player-image']") %>% 
html_attr("alt") -> player_names

and

page %>% 
html_nodes("[class='player-right text-center hidden-xs']") %>% 
html_nodes("[class='value']") %>% 
html_text() %>% 
matrix(nrow=length(player_names), ncol=6, byrow=T) -> player_stats

One way to capture player positions is to use gsub() to find the string between <strong> and </strong> from the player-club-league-name class.

 page %>% 
 html_nodes("[class='list-group list-group-table player-group-table']") %>% 
 html_nodes("[class='player-club-league-name']") %>% 
 gsub(".*<strong>(. )</strong>.*", "\\1", .) -> positions

Finally, make everything into a data.frame:

# make player_names into a tibble and extract overall score
library(tidyverse)
player_names %>% as_tibble() -> player_names 
names(player_names) <- 'player'
substr(player_names$player, str_length(player_names$player)-1, str_length(player_names$player)) -> overall
player_names$overall <- overall


# stat names for player_stats
as_tibble(player_stats) -> player_stats
names(player_stats) <- c('pac','sho','pas','dri','def','phy')

#bind everything together
bind_cols(player_names, player_stats) -> players
rm(player_names); rm(player_stats)

Result:

> players
# A tibble: 48 x 8
   player                          overall pac   sho   pas   dri   def   phy
   <chr>                           <chr>   <chr> <chr> <chr> <chr> <chr> <chr>
 1 Lionel Messi 93                 93      85    92    91    95    34    65
 2 Robert Lewandowski 92           92      78    92    79    86    44    82
 3 C. Ronaldo dos Santos Aveiro 91 91      87    93    82    88    34    75
 4 Kevin De Bruyne 91              91      76    86    93    88    64    78
 5 Neymar da Silva Santos Jr. 91   91      91    83    86    94    37    63
 6 Kylian Mbappé 91                91      97    88    80    92    36    77
 7 Harry Kane 90                   90      70    91    83    83    47    83
 8 N'Golo Kanté 90                 90      78    66    75    82    87    83
 9 Mohamed Salah 89                89      90    87    81    90    45    75
10 Karim Benzema 89                89      76    86    81    87    39    77
# … with 38 more rows