I have a data table with probabilities for a discrete distribution stored in columns.
For example, dt <- data.table(p1 = c(0.5, 0.25, 0.1), p2 = c(0.25, 0.5, 0.1), p3 = c(0.25, 0.25, 0.8))
I'd like to create a new column of a random variable sampled using the probabilities in the same row. In data.table syntax I imagine it working like this:
dt[, sample := sample(1:3, 1, prob = c(p1, p2, p3))]
If there were a 'psample' function similar to 'pmin' and 'pmax' this would work. I was able to make this work using apply, the downside is that with my real data set this takes longer than I would like. Is there a way to make this work using data.table? The apply solution is given below.
dt[, sample := apply(dt, 1, function(x) sample(1:3, 1, prob = x[c('p1', 'p2', 'p3')]))]
CodePudding user response:
If you are choosing from 1:n
you could use sampl.int
which is faster. Also apply
ing on a matrix
is faster. Putting both in a function psamp
is even faster.
apply(dt, 1, function(x) sample.int(n=3, size=1, prob=x[c('p1', 'p2', 'p3')]))
apply(as.matrix(dt), 1, function(x) sample.int(n=3, size=1, prob=x[c('p1', 'p2', 'p3')]))
apply(as.matrix(dt), 1, psamp)
So, try this (I added dt[, 1:3]
so that it won't fail once the column is added):
psamp <- function(x) sample.int(n=3, size=1, prob=x)
dt[, sample :=apply(as.matrix(dt[, 1:3]), 1, psamp)]
Benchmark, 10k repetitions each:
Unit: seconds
expr min lq mean median uq max neval cld
sample 1.361271 1.367118 1.372092 1.372965 1.377503 1.382042 3 d
sample.int 1.293450 1.303280 1.306730 1.313110 1.313370 1.313630 3 c
sample.int.mat 1.267873 1.268889 1.271717 1.269906 1.273639 1.277373 3 b
psamp 1.221795 1.223325 1.224239 1.224856 1.225461 1.226065 3 a
CodePudding user response:
I think the proper data.table approach would be to use the .SD facilities:
dt2 <- rbind(dt,dt,dt,dt)
psamp <- function(x) sample.int(n=3, size=1, prob=x) # from jay.sf
dt2[, sample :=apply(.SD, 1, psamp), .SDcols=c('p1','p2','p3')]
> dt2
p1 p2 p3 sample
1: 0.50 0.25 0.25 2
2: 0.25 0.50 0.25 1
3: 0.10 0.10 0.80 2
4: 0.50 0.25 0.25 3
5: 0.25 0.50 0.25 2
6: 0.10 0.10 0.80 3
7: 0.50 0.25 0.25 3
8: 0.25 0.50 0.25 2
9: 0.10 0.10 0.80 3
10: 0.50 0.25 0.25 1
11: 0.25 0.50 0.25 2
12: 0.10 0.10 0.80 3
A note on style: it's better to refrain from naming R objects with strings that are also names of R functions, such as df
(density of the F distribution), dt
(density of the t-distribution), data
(method to load a canned dataset).
CodePudding user response:
I think this is also an option, and might be quite fast? Perhaps @jay.sf can let us know, as it also uses psamp
(thanks @jay.sf)
set(dt,j="sample",value=apply(dt,1,psamp))