I have the following code that selects (4 rows of iris x 1000) *100 and calculates the bias of each column.
library(SimDesign)
library(data.table)
do.call(rbind,lapply(1:100, function(x) {
bias(
setDT(copy(iris))[as.vector(sapply(1:1000, function(X) sample(1:nrow(iris),4)))][
, lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
parameter=c(5,3,2,1), #parameter is the true population value used to calculate bias
type='relative' #denotes the type of bias being calculated
)
}))
This takes 1000 samples of 4 rows, calculates the mean by sample #, giving me 1000 means. The bias for the 1000 means is found for each column, and then is done 99 more times giving me a distribution of bias estimates for each column. This is mimicking a random sampling design. However, I also want to do this for a stratified design. So I use splitstackshape
's stratified
function.
do.call(rbind,lapply(1:100, function(x) {
bias(
setDT(copy(iris))[as.vector(sapply(1:1000, function(X) stratified(iris,group="Species", size=1)))][
, lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
parameter=c(5,3,2,1),
type='relative'
)
}))
I would've thought that it is just a matter of swapping out the functions, but I keep on getting errors (i is invalid type (matrix)). Perhaps in future a 2 column matrix could return a list of elements of DT . I think it might be something related to setDT, but I'm not really sure how to fix it. Anybody know where I'm going wrong?
CodePudding user response:
I've split into a couple of functions for you
Load data.table, SimDesign, and splitstackshape
library(SimDesign)
library(data.table)
library(splitstackshape)
Function to get n
stratified samples of size sampsize
and return column means of those samples
get_samples <- function(n, sampsize=4) {
rbindlist(lapply(1:n, function(x) {
splitstackshape::stratified(iris, group="Species",sampsize)[, id:=x]
}))[, lapply(.SD, mean), by=.(Species, id)]
}
Now, lets get the distribution of bias across y
such iterations of these samples
get_bias_distribution <- function(y=100, samples_per_iter=50, size_per_iter=4) {
rbindlist(lapply(1:y, function(y) {
samples = get_samples(samples_per_iter, sampsize=size_per_iter)[, id:=NULL]
samples[, as.list(bias(
estimate=.SD,parameter=c(5,3,2,1),type="relative")*100),
by=.(Species)][, iter:=y]
}))
}
Usage (using defaults)
get_bias_distribution()
Output:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width iter
1: setosa -1.236667 22.61833 -26.70000 -39.69667 1
2: versicolor 46.476667 -11.99500 115.12833 16.82167 1
3: virginica 80.596667 -0.20000 180.21833 53.89000 1
4: setosa -1.513333 20.87000 -27.46167 -38.83667 2
5: versicolor 45.333333 -11.34833 112.84833 17.84500 2
---
296: versicolor 48.250000 -12.26833 113.37000 17.71167 99
297: virginica 77.366667 -2.87000 175.60000 53.07167 99
298: setosa -1.005000 22.67500 -27.02833 -39.69500 100
299: versicolor 47.921667 -10.28333 110.97833 16.86833 100
300: virginica 76.153333 -2.44000 174.46167 52.62167 100
Some comments on what was going wrong above
- When you call
stratified(iris,group="Species", size=1)
, you will get a 3 row data.table, because you are effectively selecting one row at random from each of the three Species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 4.9 3.6 1.4 0.1 setosa
2: 6.3 2.5 4.9 1.5 versicolor
3: 7.7 2.8 6.7 2.0 virginica
- When you wrap this in
sapply(1:1000, function(x)...)
, you get 5 x 1000 column matrix, where each column is contains 5 lists of length 3 .. Below, I'm showing you what this looks like if you didsapply(1:6, function(x)...)
[,1] [,2] [,3] [,4] [,5] [,6]
Sepal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Sepal.Width numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Petal.Length numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Petal.Width numeric,3 numeric,3 numeric,3 numeric,3 numeric,3 numeric,3
Species factor,3 factor,3 factor,3 factor,3 factor,3 factor,3
This is not really what you want, because you cannot then lapply
over these the way you then intended. What you want to do instead is use lapply(1:1000, function(x) ...)
to create a list of such 3-row datatables, and then bind them together (after adding an id
column to each one).