Good day everyone
Take this dataset:
df <- tibble(
id = 1:1000,
smoking = sample(c(TRUE, FALSE), length(id), prob = c(0.2,0.8), replace = TRUE),
age = rnorm(length(id), mean = 60, sd = 10)
)
I want to add another logical variable lung_cancer
to the dataframe where the TRUE
or FALSE
are assigned through probability distribution that is calculated based on the patients smoking
and age
status
I understand that this requires looping over each index, and I can manage to do it using For()
loop so I wrote the following:
df$lung_cancer <- vector("logical", length(id))
for (i in seq_along(df$lung_cancer)) {
df$lung_cancer[[i]] = if_else(df$age[[i]] > 50, case_when(
df$age[[i]] > 50 & df$smoking[[i]] == TRUE ~ sample(c(TRUE, FALSE), 1, prob = c(0.05, 0.95)),
df$age[[i]] > 50 & df$smoking[[i]] == FALSE ~ sample(c(TRUE, FALSE), 1, prob = c(0.001, 0.999))
), FALSE
)
}
Now I find this to be too verbose, is there any concise way to write this with mutate()
function and purrr
package or any other way (preferably from tidyverse
package collection)?
CodePudding user response:
The case_when()
function should be all you needed, but it does not seem to re-evaluating for each TRUE event.
Here is a simple base R solution taking advantage of R's vectorization ability (thus avoiding the loop).
#set all to the default value
df$lung_cancer<-FALSE
#perform the selections and then set to new value
df$lung_cancer[df$age > 50 & df$smoking == TRUE ] <- sample(c(TRUE, FALSE), nrow(df), prob = c(0.05, 0.95), replace = TRUE)[df$age > 50 & df$smoking == TRUE ]
df$lung_cancer[df$age > 50 & df$smoking == FALSE] <- sample(c(TRUE, FALSE), nrow(df), prob = c(0.001, 0.999), replace = TRUE)[df$age > 50 & df$smoking == FALSE]
Case_when() question
To define a default value with case_when, as your last test, define a TRUE statement. Such as in this example:
case_when (
df$age> 50 & df$smoking == TRUE ~ "Group1",
df$age > 50 & df$smoking == FALSE ~ "Group2",
TRUE ~ "Everyone Else"
)
See ?case_when
for more examples
CodePudding user response:
data.table
allows you to mutate a portion of a column. This way the samples can be generated only twice instead of 1000 times.
library(data.table)
set.seed(42)
df <- data.table(
id = 1:1000,
smoking = sample(c(TRUE, FALSE), length(id), prob = c(0.2,0.8), replace = TRUE),
age = rnorm(length(id), mean = 60, sd = 10)
) %>%
.[, lung_cancer := FALSE] %>%
.[age > 50 & smoking, lung_cancer := sample(c(TRUE, FALSE), .N, prob = c(0.05, 0.95), replace = TRUE)] %>%
.[age > 50 & !smoking, lung_cancer := sample(c(TRUE, FALSE), .N, prob = c(0.001, 0.999), replace = TRUE)] %>%
.[]
df[, .(.N, lc = sum(lung_cancer)), keyby = smoking]
smoking N lc
1: FALSE 804 2
2: TRUE 196 5
I put a "report" at the end.
(You can convert your tibble to a data.table with setDT()
instead, if necessary)