I am working with the R programming language. Suppose I have the following data:
library("dplyr")
df <- data.frame(b = rnorm(100,5,5), d = rnorm(100,2,2),
c = rnorm(100,10,10))
a <- c("a", "b", "c", "d", "e")
a <- sample(a, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))
a<- as.factor(a)
df$a = a
> head(df)
b d c a
1 3.1316480 0.5032860 4.7362991 a
2 4.3111450 -0.1142736 -0.5841322 c
3 2.8291346 3.6107839 16.0684492 a
4 14.2142245 4.9893987 -1.8145138 a
5 -6.7381302 0.0416782 -7.7675387 c
6 0.4481874 0.3370716 17.4260801 a
I also have the following function ("my_subset_mean") which evaluates the mean of the "column c" given a specific choice of inputs:
my_subset_mean <- function(r1, r2, r3){
subset <- df %>% filter(a %in% r1, b > r2, d < r3)
return(mean(subset$c))
}
my_subset_mean(r1 = c("a", "b"), r2 = 5, r3 = 1 )
[1] 5.682513
My Question: I am trying to evaluate the function "my_subset_mean" at random combinations of "r1", "r2" and "r3". For example:
my_subset_mean(r1 = c("a", "b"), r2 = 5, r3 = 1 )
[1] 11.46365
my_subset_mean(r1 = c("a", "b"), r2 = 5, r3 = 1 )
[1] 11.46365
my_subset_mean(r1 = c("a"), r2 = 2, r3 = 0 )
[1] 14.59809
my_subset_mean(r1 = c("a", "b", "c"), r2 = 3.1, r3 = 0 )
[1] 11.26508
#I am not sure how to get this one to work (i.e. ignore "r1" all together and only calculate the mean using r2 and r3)
my_subset_mean(r1 = "NA", r2 = 3.1, r3 = 0 )
[1] NaN
etc.
Is it possible to make a "grid" that contains random values of "r2" and "r3" (e.g. random values of "r2" and "r3" between 0 and 5) along with random subsets of "r1" (e.g. "a", "c, d", "b, a, e", "d"):
> head(my_grid)
r2 r3 r1
1 3.1316480 0.5032860 a, b
2 4.3111450 -0.1142736 c, d, e
3 2.8291346 3.6107839 a
4 14.2142245 4.9893987 b, e
5 -6.7381302 0.0416782 NA
6 0.4481874 0.3370716 e
And then evaluate "my_subset_mean" at each row of "my_grid"? E.g.
#desired result
> head(final_answer)
r2 r3 r1 my_subset_mean
1 3.1316480 0.5032860 a, b 0.3
2 4.3111450 -0.1142736 c, d, e 0.1
3 2.8291346 3.6107839 a 0.55
4 14.2142245 4.9893987 b, e 0.6
5 -6.7381302 0.0416782 NA 0.51
6 0.4481874 0.3370716 e 0.16
If there were no "factor variables" involved, I think I could have done this with an iterative "for loop". But I am not sure how to "feed" the function ("my_subset_mean") using "my_grid". Can someone please show me how to do this?
Thanks!
CodePudding user response:
I think this code might help you
library(tidyverse)
r1_sim <- c("a", "b", "c", "d", "e")
r2_sim <- seq(0,1,.2)
r3_sim <- seq(0,1,.2)
expand_grid(r1 = r1_sim,r2 = r2_sim, r3 = r3_sim) %>%
rowwise() %>%
mutate(my_subset_mean(r1,r2,r3))
# A tibble: 180 x 4
# Rowwise:
r1 r2 r3 `my_subset_mean(r1, r2, r3)`
<chr> <dbl> <dbl> <dbl>
1 a 0 0 16.5
2 a 0 0.2 12.9
3 a 0 0.4 12.9
4 a 0 0.6 12.9
5 a 0 0.8 12.9
6 a 0 1 13.4
7 a 0.2 0 16.5
8 a 0.2 0.2 12.9
9 a 0.2 0.4 12.9
10 a 0.2 0.6 12.9
# ... with 170 more rows
CodePudding user response:
You may write a function to select random value for r1
, r2
and r3
based on the data that you have. runif
will help you create random number in range.
create_output <- function() {
uv <- levels(df$a)
r1 <- sample(uv, sample(length(uv)))
rgb <- range(df$b)
rgd <- range(df$d)
r2 <- runif(1, rgb[1], rgb[2])
r3 <- runif(1, rgd[1], rgd[2])
my_subset_mean <- my_subset_mean(r1, r2, r3)
data.frame(r1 = toString(r1), r2, r3, my_subset_mean)
}
Run it once
create_output()
# r1 r2 r3 my_subset_mean
#1 d, c, e, a -0.5762248 -0.3233672 0.3470009
Run it 100 times and bind the result.
out <- do.call(rbind, replicate(100, create_output(), simplify = FALSE))
head(out)
# r1 r2 r3 my_subset_mean
#1 e, d -6.870120 4.9283288 12.604477
#2 d, c, b, e 13.730295 4.0619485 7.749107
#3 e -4.990023 5.4652763 13.441422
#4 c, a 2.095414 5.4337308 10.603865
#5 d, c, b, e -6.614294 -0.4182057 6.703294
#6 a, c, d, b, e 17.369292 3.9566795 7.749107