Home > Net >  Web scraping with R (rvest)
Web scraping with R (rvest)

Time:04-04

guys! I hope you're all fine! I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate! ^^

Idea

I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.

Explanation and problem

I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:

  1. The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
  2. The selected CSS was from 2003 championship page.

So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!

The code

library("dplyr")
library("rvest")

link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)

links_temporadas <- pagina_wikipedia %>%
  html_nodes("tr:nth-child(9) div a") %>%
  html_attr("href") %>%
  paste("https://pt.wikipedia.org", ., sep = "")


tabela <- function(link){
  pagina_tabela <- read_html(link)
  
  tabela_wiki = link %>%
    html_nodes("table.wikitable") %>%
    html_table() %>%
    paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)

CodePudding user response:

You can get all the tables from those links by doing this:

tabela <- function(link){
  read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}

all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022

This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.

lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 
   6    5   10    9   10   12   11   10   12   11   13   14   17   16   16   16   16   15   17    7 

You will need to determine which table(s) you are interested in from each of these years.

CodePudding user response:

Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".

Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".

suppressPackageStartupMessages({
  library("dplyr")
  library("rvest")
})

link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)

links_temporadas <- pagina_wikipedia %>%
  html_nodes("tr:nth-child(9) div a") %>%
  html_attr("href") %>%
  paste("https://pt.wikipedia.org", ., sep = "")


tabela <- function(link){
  pagina_tabela <- read_html(link)

  lista_wiki <- pagina_tabela %>%
    html_elements("table.wikitable") %>%
    html_table()

  i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
  i <- which(i)[1]
  lista_wiki[[i]]
}

tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)

sapply(tabela_geral, ncol)
#>  [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)

common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)

tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#>    Pos. Equipes       P         J     V     E     D    GP    GC SG      `%`
#>   <int> <chr>         <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1     1 Cruzeiro      100      46    31     7     8   102    47  55      72
#> 2     2 Santos        87       46    25    12     9    93    60  33      63
#> 3     3 São Paulo     78       46    22    12    12    81    67  14      56
#> 4     4 São Caetano   742      46    19    14    13    53    37  16      53
#> 5     5 Coritiba      73       46    21    10    15    67    58  9       52
#> 6     6 Internacional 721      46    20    10    16    59    57  2       52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>

Created on 2022-04-03 by the reprex package (v2.0.1)

To have all columns, including the "M" columns:

data.table::rbindlist(tabela_geral, fill = TRUE)
  • Related