Mutating several columns of many dataframes with For loop or Apply-CodePudding

I'm trying to use a loop or an apply family solution for the next problem. I have few dataframes such as:

df1 <- data.frame(a = c(1,2,3,NA,NA,NA,NA,NA,9,NA),b = c(1,2,3,4,NA,NA,NA,8,9,10),c = c(1,2,3,NA,NA,NA,7,8,NA,NA))

df2 <- data.frame(a = c(1,2,3,4,5,6,NA,NA,NA,10),b = c(1,2,3,4,NA,NA,NA,8,9,10),c = c(1,2,3,NA,NA,NA,7,8,NA,NA))

df5 <- data.frame(a = c(1,2,3,4,5,6,NA,NA,9,10),b = c(1,2,3,4,5,6,NA,8,9,10),c = c(1,2,3,NA,NA,NA,7,8,9,NA))

where Im trying to use na.approx to fill in some NA gaps. What I had in mind is:

l <- c(1,2,5)
for (i in l){
    df[[i]] <- df[[i]] %>% mutate(a = na.approx(a, na.rm = FALSE))
    df[[i]] <- df[[i]] %>% mutate(b = na.approx(b, na.rm = FALSE))
    df[[i]] <- df[[i]] %>% mutate(c = na.approx(c, na.rm = FALSE))
}

with this example Im getting the following error:

Error in UseMethod("mutate") : 
no applicable method for 'mutate' applied to an object of class "c('double', 'numeric')"

and with my actual data Im getting this error:

Error in `vectbl_as_col_location2()`:
! Can't extract columns past the end.
i Location 13101 doesn't exist.
i There are only 16 columns.

where "13101" would be part of a dataframe named "df13101".

When I check class of dataframes, I get

[1] "data.frame"

for the example but my actual dataframe I get

[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

and when I check the type of each variable I want to mutate all are numeric (example and real ones).

I need to understand how to properly call these dataframes and what problems I could face because of the data class or the usage of mutate. I've tried using mapply but I'm very new to R and I'm barely learning about the whole apply family.

Any help would be great, thanks for reading!

CodePudding user response：

It is easier to do this if the dataframes are stored in a list. You can then apply the function to each numeric column.

library(dplyr)
library(zoo)

l <- c(1,2,5)
list_of_data <- mget(paste0('df', l))

list_of_data <- purrr::map(list_of_data, ~.x %>%
                      mutate(across(where(is.numeric), 
                       ~na.approx(.x, na.rm = FALSE))))

list_of_data
#$df1
#    a  b  c
#1   1  1  1
#2   2  2  2
#3   3  3  3
#4   4  4  4
#5   5  5  5
#6   6  6  6
#7   7  7  7
#8   8  8  8
#9   9  9 NA
#10 NA 10 NA

#$df2
#    a  b  c
#1   1  1  1
#2   2  2  2
#3   3  3  3
#4   4  4  4
#...
#...

If you want the new values to be reflected in the actual dataframes again use list2env.

list2env(list_of_data, .GlobalEnv)

CodePudding user response：

The code in the question has these problems:

df[[1]] is not the same as df1. The first one refers to the first column of df (which does not exist) and the second one is the valid input. Instead, if e is the environment where df1, etc. are located then we can refer to df1 as e[["df1"]] in terms of the string "df1".
There is no point in applying na.approx separately to each column since na.approx can handle an entire numeric data frame at once.
This may or may not be a problem for you but note that the code overwrites df1, etc. so if you want to test it again after running it then it will be necessary to recreate the original df1, etc. You may wish to use lists as shown in the second approach below instead.

Below we assume that the input data frames are in the global environment, i.e. sitting in your workspace. (Replace the e <- ... line with e <- environment() if the data frames are in the current, rather than global, environment. If the data frames were defined and located only within a function and they are being referenced within the same function that would be the case.)

e[[nm]] refers to the object whose name in environment e is given by the value of the character string held in the nm variable. We then apply na.approx to that and assign it back. Note that na.approx returns a matrix when applied to a data.frame so we use [] on the left hand side to insert the values from the matrix into the data frame.

library(zoo)

e <- .GlobalEnv
nms <- paste0("df", l)
for (nm in nms) e[[nm]][] <- na.approx(e[[nm]], na.rm = FALSE)

Alternately put the data frames in a named list L

L <- mget(nms) # nms defined above
for (nm in nms) L[[nm]][] <- na.approx(L[[nm]], na.rm = FALSE)