avoid creating duplicate variable names in function-CodePudding

I have a dataframe and a function that creates a new variable, adds it to the dataframe and then assigns the dataframe back to the global environment. The problem is that if I rerun the function it creates a duplicate of the variable.

library(tidyverse)
library(rms)
set.seed(10)
ds <- data.frame(
  ftime = rexp(200),
  fstatus = sample(0:1,200, replace = TRUE),
  x1 = runif(200),
  x2 = runif(200),
  x3 = factor(sample(LETTERS[1:3], size = 200, replace = TRUE)))
ds
#model
s <- Surv(ds$ftime, ds$fstatus == 1) 
fit <- cph(s ~ x1   x2   x3, data = ds, surv = TRUE, x = TRUE, y = TRUE)

#function to add prediction to dataset
pred_fun <- function(time_to_sur, model) {
  
  pred_data <- ds[, c("x1", "x2", "x3")] %>% 
    mutate(ftime = time_to_sur,
           fstatus = 1) %>%   
    as.data.frame()
  
  ds$pred_var_tmp <-
    rms::survest(model, times = time_to_sur,
                 newdata = pred_data,
                 se.fit = FALSE, what = "survival")$surv
  
  #rename variable
  pred_var <- paste0("pred_prob_", as.character(time_to_sur), "_rms")
  names(ds)[names(ds) == "pred_var_tmp"] <- pred_var
  
  #assign dataset back to global environment
  assign("ds", ds, env = .GlobalEnv) 
}

The function works as it should:

pred_fun(time_to_sur = 0.2, fit)
names(ds)
# [1] "ftime"             "fstatus"           "x1"               
# [4] "x2"                "x3"                "pred_prob_0.2_rms"

But if I rerun it again, it creates a duplicate of the variable

pred_fun(time_to_sur = 0.2, fit)
names(ds)
# [1] "ftime"             "fstatus"           "x1"               
# [4] "x2"                "x3"                "pred_prob_0.2_rms"
# [7] "pred_prob_0.2_rms"

This is to be expected because the function create a new variable first with a different name and then assigns the name after. I thought the following might work in the function but it doesn't:

ds$eval(substitute(paste0("pred_prob_", as.character(tt), "_rms"))) <-
    rms::survest(model, times = time_to_sur,
                 newdata = pred_data,
                 se.fit = FALSE, what = "survival")$surv

How can I fix this and what is best practices in this situation?

Thanks

CodePudding user response：

1) Base R This will overwrite the existing column if it already exists. This overwrites Time, which is originally c(1, 2, 3, 4, 5, 7) with 11:16.

newName <- "Time" # duplicated column name
values <- 11:16
replace(BOD, newName, values)
##   Time demand
## 1   11    8.3
## 2   12   10.3
## ...

If the new column name did not exist then it creates a new column.

newName <- "Time2" # new column name, not duplicate
values <- 11:16
replace(BOD, newName, values)
##   Time demand Time2
## 1    1    8.3    11
## 2    2   10.3    12
## ...

2) dplyr If you would like to use dplyr for this then:

library(dplyr)

newName <- "Time" # duplicated column name
values <- 11:16
mutate(BOD, {{newName}} := values)
##   Time demand
## 1   11    8.3
## 2   12   10.3
## ...

newName <- "Time2" # new column name, not duplicate
values <- 11:16
mutate(BOD, {{newName}} := values)
##   Time demand Time2
## 1    1    8.3    11
## 2    2   10.3    12
## ...

Other

Functional nature of R. R is a functional language and normally functions are written to pass the input in via the arguments and to pass the output as the return value. Here x is the input and y is the output.

# ok
f <- function(x) x   1
y <- f(3)
y
## [1] 4

It is better not to do this:

#  not good
f <- function() assign("y",  x   1, .GlobalEnv)
x <- 3
f()
y
## [1] 4

Replacement functions. Although not frequently used R does support replacement functions and syntax like this (see https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Subset-assignment). For example, names<- is a commonly used replacement function. We create a duplicate of BOD first so that we don't overwrite it. This preserves the input making it easier to debug.

`f<-` <- function(x, value) replace(x, "Time", value)
BOD2 <- BOD
f(BOD2) <- 11:16
BOD2
##   Time demand
## 1   11    8.3
## 2   12   10.3
## ...

CodePudding user response：

Thanks to @G. Grothendieck and @Limey, the following simplification works (pred_fun_final) although I do get a warning message.

#original function in OP
pred_fun_original <- function(time_to_sur, model) {
  
  pred_data <- ds[, c("x1", "x2", "x3")] %>% 
    mutate(ftime = time_to_sur,
           fstatus = 1) %>%   
    as.data.frame()
  
  ds$pred_var_tmp <-
    rms::survest(model, times = time_to_sur,
                 newdata = pred_data,
                 se.fit = FALSE, what = "survival")$surv
  
  #rename variable
  pred_var <- paste0("pred_prob_", as.character(time_to_sur), "_rms")
  names(ds)[names(ds) == "pred_var_tmp"] <- pred_var

  assign("ds", ds, env = .GlobalEnv) 
}
pred_fun_original(time_to_sur = 0.2, fit)

#save created variable
test1 <- ds$pred_prob_0.2_rms

#remove pred_prob_0.2_rms
ds <- ds %>% 
  select(-pred_prob_0.2_rms)

New function with warning:

#fixed function
pred_fun_final <- function(data, time_to_sur, model) {
  
  newName <- paste0("pred_prob_", as.character(time_to_sur), "_rms")
  pred_data <- data[, c("x1", "x2", "x3")] %>% 
    mutate(ftime = time_to_sur,
           fstatus = 1) %>%   
    as.data.frame()
  
  data <- data %>% 
    mutate({{newName}} := rms::survest(model, times = time_to_sur,
                                       newdata = pred_data,
                                       se.fit = FALSE, what = "survival")$surv)
  
  data
}
ds <- pred_fun_final(ds, time_to_sur = 0.2, fit)
# Warning message:
# Problem with `mutate()` column `pred_prob_0.2_rms`.
# i `pred_prob_0.2_rms = ...$NULL`.

#save variable
test2 <- ds$pred_prob_0.2_rms

The two variables are not identical but that is because one is named and the other is not (as.numeric() would fix this). It doesn't explain the warning message though.

identical(test1, test2)
#FALSE
str(test1)
#  num [1:200] 0.906 0.9 0.884 0.884 0.886 ...
str(test2)
#  Named num [1:200] 0.906 0.9 0.884 0.884 0.886 ...
#  - attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...