Home > Mobile >  R: Select vector (numeric) from data frame, sample n=10 subsets of size i=5 and i= 10 within vector
R: Select vector (numeric) from data frame, sample n=10 subsets of size i=5 and i= 10 within vector

Time:04-29

I have the following problem:

  1. Have a data frame, i.e. containing two vectors "Name" and "Values", one as text and one with numeric values, with 20 rows and 2 columns
  2. I want to extract "Values" and sample randomly (with equal weight) 10x a subset of size 5 from the "Values" and calculate the mean. I want to capture those results (mean values) in another vector 10x1.
  3. I want to do the same as step 2, however, instead of sampling a subset of size 5, I want to have more observations, i.e. 15 (from the 20 values). I take those 15 values, calculate the mean an re-iterate this step 10x, logging in the results in a new vector 10x1. (4. Ultimately, I want to compare some descriptive statistics between these two vectors, i.e. expecting that the smaller subset size vector would have fatter tails, more negatively skewed etc).

Creating the data frame as a start

Name <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t")
Values <- c(0.1, 0.05, 0.03, 0.06, -0.1, -0.3, -0.05, 0.5, 0.12, 0.06, 0.04, 0.15, 0.13, 0.16, -0.12, -0.03, -0.5, 0.05, 0.07, 0.03)
data <- data.frame(Name, Values)

The relevant part:

# extract Values column
Values <- data$Values

# define sizes of subset and number of iterations
n_small <- 5
n_large <- 15
n_iterations <- 10

set.seed(123456)

# Initialize result vector
Averages_small <- NULL
Averages_large <- NULL

# Calculate average of the subset and allocate it to the result vector
for (i in n_iterations) {
  Averages_small[i] <- mean(sample(Values, n_small, replace = FALSE))
  Averages_large[i] <- mean(sample(Values, n_large, replace = FALSE))
}

Somehow this gives ma 9x NA and a number. What I am doing wrong? and is there a better way than for-loop this through, because above is an example and also no NA values, however, the original data set has 20k rows and it might "contain" missing values.

fyi, to give you a background: the Values are return figures of investments and the question is having a higher number of investments helps diversification.

Thank you very much for your help!

CodePudding user response:

You can use replicate to get 10 draws of your sample. This returns a matrix with the samples in columns, so the colMeans of this matrix gives you the vector you are looking for:

set.seed(1) # For reproducibility

vec5  <- colMeans(replicate(10, sample(data$Values, 5)))
vec15 <- colMeans(replicate(10, sample(data$Values, 15)))

vec5
#> [1] -0.014  0.148  0.044 -0.026  0.062  0.020 -0.032 -0.130  0.166  0.040

vec15
#> [1]  0.058000000  0.024666667  0.051333333  0.045333333  0.024000000
#> [6]  0.010666667  0.022666667 -0.010000000  0.003333333 -0.001333333

You can see that the standard deviation of vec5 is indeed larger:

sd(vec5)
#> [1] 0.08711908

sd(vec15)
#> [1] 0.02297406

CodePudding user response:

I know that this question has already been answered, but I have found the mistake in your original code that caused it to not work.
The code as you wrote it can actually work as you want it to, but the for loop only fired once; for (i in v) loops over a vector, repeating with each value listed. Remember that you set

n_iterations <- 10

So in your loop, you effectively had for (i in 10), such that the loop was only called once, meaning that the whole structure ended up being

Averages_small[10] <- mean(sample(Values, n_small, replace = FALSE))
Averages_large[10] <- mean(sample(Values, n_large, replace = FALSE))

What you want is for (i in 1:10), which creates a vector. This can be solved either be defining n_iterations <- 1:10, or (using your original setup)

set.seed(123456)
for (i in 1:n_iterations) {
     Averages_small[i] <- mean(sample(Values, n_small, replace = FALSE))
     Averages_large[i] <- mean(sample(Values, n_large, replace = FALSE))
 }
Averages_small
#> [1] -0.066  0.042  0.036  0.018  0.080  0.016 -0.038 -0.180  0.132  0.042
Averages_large
#> [1] -0.02600000 -0.01266667  0.02000000  0.04666667  0.03533333 -0.02200000 -0.01533333 -0.00400000  0.03266667  0.07333333

I know that for loops are generally not optimal, and a solution that does not rely on one is probably superior, but I also thought that you would appreciate an explanation of why your code did not function correctly in the first place.

  • Related