I have a dataframe
and a function
that creates a new variable, adds it to the dataframe and then assign
s the dataframe back to the global environment. The problem is that if I rerun the function it creates a duplicate of the variable.
library(tidyverse)
library(rms)
set.seed(10)
ds <- data.frame(
ftime = rexp(200),
fstatus = sample(0:1,200, replace = TRUE),
x1 = runif(200),
x2 = runif(200),
x3 = factor(sample(LETTERS[1:3], size = 200, replace = TRUE)))
ds
#model
s <- Surv(ds$ftime, ds$fstatus == 1)
fit <- cph(s ~ x1 x2 x3, data = ds, surv = TRUE, x = TRUE, y = TRUE)
#function to add prediction to dataset
pred_fun <- function(time_to_sur, model) {
pred_data <- ds[, c("x1", "x2", "x3")] %>%
mutate(ftime = time_to_sur,
fstatus = 1) %>%
as.data.frame()
ds$pred_var_tmp <-
rms::survest(model, times = time_to_sur,
newdata = pred_data,
se.fit = FALSE, what = "survival")$surv
#rename variable
pred_var <- paste0("pred_prob_", as.character(time_to_sur), "_rms")
names(ds)[names(ds) == "pred_var_tmp"] <- pred_var
#assign dataset back to global environment
assign("ds", ds, env = .GlobalEnv)
}
The function works as it should:
pred_fun(time_to_sur = 0.2, fit)
names(ds)
# [1] "ftime" "fstatus" "x1"
# [4] "x2" "x3" "pred_prob_0.2_rms"
But if I rerun it again, it creates a duplicate of the variable
pred_fun(time_to_sur = 0.2, fit)
names(ds)
# [1] "ftime" "fstatus" "x1"
# [4] "x2" "x3" "pred_prob_0.2_rms"
# [7] "pred_prob_0.2_rms"
This is to be expected because the function create a new variable first with a different name and then assigns the name after. I thought the following might work in the function but it doesn't:
ds$eval(substitute(paste0("pred_prob_", as.character(tt), "_rms"))) <-
rms::survest(model, times = time_to_sur,
newdata = pred_data,
se.fit = FALSE, what = "survival")$surv
How can I fix this and what is best practices in this situation?
Thanks
CodePudding user response:
1) Base R This will overwrite the existing column if it already exists. This overwrites Time, which is originally c(1, 2, 3, 4, 5, 7) with 11:16.
newName <- "Time" # duplicated column name
values <- 11:16
replace(BOD, newName, values)
## Time demand
## 1 11 8.3
## 2 12 10.3
## ...
If the new column name did not exist then it creates a new column.
newName <- "Time2" # new column name, not duplicate
values <- 11:16
replace(BOD, newName, values)
## Time demand Time2
## 1 1 8.3 11
## 2 2 10.3 12
## ...
2) dplyr If you would like to use dplyr for this then:
library(dplyr)
newName <- "Time" # duplicated column name
values <- 11:16
mutate(BOD, {{newName}} := values)
## Time demand
## 1 11 8.3
## 2 12 10.3
## ...
newName <- "Time2" # new column name, not duplicate
values <- 11:16
mutate(BOD, {{newName}} := values)
## Time demand Time2
## 1 1 8.3 11
## 2 2 10.3 12
## ...
Other
Functional nature of R. R is a functional language and normally functions are written to pass the input in via the arguments and to pass the output as the return value. Here x is the input and y is the output.
# ok
f <- function(x) x 1
y <- f(3)
y
## [1] 4
It is better not to do this:
# not good
f <- function() assign("y", x 1, .GlobalEnv)
x <- 3
f()
y
## [1] 4
Replacement functions. Although not frequently used R does support replacement functions and syntax like this (see https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Subset-assignment). For example, names<-
is a commonly used replacement function. We create a duplicate of BOD first so that we don't overwrite it. This preserves the input making it easier to debug.
`f<-` <- function(x, value) replace(x, "Time", value)
BOD2 <- BOD
f(BOD2) <- 11:16
BOD2
## Time demand
## 1 11 8.3
## 2 12 10.3
## ...
CodePudding user response:
Thanks to @G. Grothendieck and @Limey, the following simplification works (pred_fun_final
) although I do get a warning
message.
#original function in OP
pred_fun_original <- function(time_to_sur, model) {
pred_data <- ds[, c("x1", "x2", "x3")] %>%
mutate(ftime = time_to_sur,
fstatus = 1) %>%
as.data.frame()
ds$pred_var_tmp <-
rms::survest(model, times = time_to_sur,
newdata = pred_data,
se.fit = FALSE, what = "survival")$surv
#rename variable
pred_var <- paste0("pred_prob_", as.character(time_to_sur), "_rms")
names(ds)[names(ds) == "pred_var_tmp"] <- pred_var
assign("ds", ds, env = .GlobalEnv)
}
pred_fun_original(time_to_sur = 0.2, fit)
#save created variable
test1 <- ds$pred_prob_0.2_rms
#remove pred_prob_0.2_rms
ds <- ds %>%
select(-pred_prob_0.2_rms)
New function with warning
:
#fixed function
pred_fun_final <- function(data, time_to_sur, model) {
newName <- paste0("pred_prob_", as.character(time_to_sur), "_rms")
pred_data <- data[, c("x1", "x2", "x3")] %>%
mutate(ftime = time_to_sur,
fstatus = 1) %>%
as.data.frame()
data <- data %>%
mutate({{newName}} := rms::survest(model, times = time_to_sur,
newdata = pred_data,
se.fit = FALSE, what = "survival")$surv)
data
}
ds <- pred_fun_final(ds, time_to_sur = 0.2, fit)
# Warning message:
# Problem with `mutate()` column `pred_prob_0.2_rms`.
# i `pred_prob_0.2_rms = ...$NULL`.
#save variable
test2 <- ds$pred_prob_0.2_rms
The two variables are not identical but that is because one is named and the other is not (as.numeric()
would fix this). It doesn't explain the warning
message though.
identical(test1, test2)
#FALSE
str(test1)
# num [1:200] 0.906 0.9 0.884 0.884 0.886 ...
str(test2)
# Named num [1:200] 0.906 0.9 0.884 0.884 0.886 ...
# - attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...