Home > Mobile >  how to change a column in a nested data frame to integer - R
how to change a column in a nested data frame to integer - R

Time:09-26

I'm scraping a website and while trying to merge the data, I get this error:

Error in `f()`:
! Can't join on `x$Grand Slam Pts` x `y$Grand Slam Pts`
  because of incompatible types.
ℹ `x$Grand Slam Pts` is of type <character>>.
ℹ `y$Grand Slam Pts` is of type <integer>>.
Run `rlang::last_error()` to see where the error occurred.

As the error says, some of the 'Grand Slam Pts' columns are characters and the others are numeric. Everything is nested in a list. How do I change the 'Grand Slam Pts' column so they're all integers?

library(rvest); library(tidyverse); library(stringr)

my_links = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/links_for_table400-415.csv')

my_links = unname(unlist(my_links))

table_list <- list()

for (i in 1:16) {
  
  page <- read_html(my_links[i])
  
  the_table <- page %>% html_table() %>% .[[1]] %>% 
    filter(X1 != "")
  
  colnames(the_table) <- page %>% html_element(xpath = "//thead") %>% 
    html_text() %>% 
    str_replace_all("\n ", "\n") %>% 
    str_split("\n") %>% 
    unlist() %>% 
    .[.!=""]
  
  table_list[[i]] <- the_table

} 

final_table <- Reduce(full_join,table_list)

CodePudding user response:

You can do two things:

  1. Ensure the correct column format upstream (i.e. in your for loop when scraping & parsing data), or
  2. Fix things downstream (i.e. after scraping & parsing data).

I recommend option 1 as it's always better to fix things upstream when possible. I have included in-line comments on the two additional lines I added.

for (i in 1:16) {
    
    page <- read_html(my_links[i])
    
    the_table <- page %>% html_table() %>% .[[1]] %>% 
        filter(X1 != "") %>%
        # Replace "-" with NA
        mutate(across(everything(), ~ na_if(.x, "-"))) %>%
        # Use `readr::parse_guess` to guess column format for char cols
        mutate(across(where(is.character), parse_guess))
    
    colnames(the_table) <- page %>% html_element(xpath = "//thead") %>% 
        html_text() %>% 
        str_replace_all("\n ", "\n") %>% 
        str_split("\n") %>% 
        unlist() %>% 
        .[.!=""]
    
    table_list[[i]] <- the_table
    
} 

Note: I'm using readr::parse_guess() here instead of as.integer() to be more error-proof. This may or may not be necessary.

  • Related