data.table sample with probabilities stored in columns-CodePudding

I have a data table with probabilities for a discrete distribution stored in columns.

For example, dt <- data.table(p1 = c(0.5, 0.25, 0.1), p2 = c(0.25, 0.5, 0.1), p3 = c(0.25, 0.25, 0.8))

I'd like to create a new column of a random variable sampled using the probabilities in the same row. In data.table syntax I imagine it working like this:

dt[, sample := sample(1:3, 1, prob = c(p1, p2, p3))]

If there were a 'psample' function similar to 'pmin' and 'pmax' this would work. I was able to make this work using apply, the downside is that with my real data set this takes longer than I would like. Is there a way to make this work using data.table? The apply solution is given below.

dt[, sample := apply(dt, 1, function(x) sample(1:3, 1, prob = x[c('p1', 'p2', 'p3')]))]

CodePudding user response：

If you are choosing from 1:n you could use sampl.int which is faster. Also applying on a matrix is faster. Putting both in a function psamp is even faster.

apply(dt, 1, function(x) sample.int(n=3, size=1, prob=x[c('p1', 'p2', 'p3')]))
apply(as.matrix(dt), 1, function(x) sample.int(n=3, size=1, prob=x[c('p1', 'p2', 'p3')]))
apply(as.matrix(dt), 1, psamp)

So, try this (I added dt[, 1:3] so that it won't fail once the column is added):

psamp <- function(x) sample.int(n=3, size=1, prob=x)
dt[, sample :=apply(as.matrix(dt[, 1:3]), 1, psamp)]

Benchmark, 10k repetitions each:

Unit: seconds
           expr      min       lq     mean   median       uq      max neval  cld
         sample 1.361271 1.367118 1.372092 1.372965 1.377503 1.382042     3    d
     sample.int 1.293450 1.303280 1.306730 1.313110 1.313370 1.313630     3   c 
 sample.int.mat 1.267873 1.268889 1.271717 1.269906 1.273639 1.277373     3  b  
          psamp 1.221795 1.223325 1.224239 1.224856 1.225461 1.226065     3 a

CodePudding user response：

I think the proper data.table approach would be to use the .SD facilities:

dt2 <- rbind(dt,dt,dt,dt)
psamp <- function(x) sample.int(n=3, size=1, prob=x) # from jay.sf

dt2[, sample :=apply(.SD, 1, psamp), .SDcols=c('p1','p2','p3')]
> dt2
      p1   p2   p3 sample
 1: 0.50 0.25 0.25      2
 2: 0.25 0.50 0.25      1
 3: 0.10 0.10 0.80      2
 4: 0.50 0.25 0.25      3
 5: 0.25 0.50 0.25      2
 6: 0.10 0.10 0.80      3
 7: 0.50 0.25 0.25      3
 8: 0.25 0.50 0.25      2
 9: 0.10 0.10 0.80      3
10: 0.50 0.25 0.25      1
11: 0.25 0.50 0.25      2
12: 0.10 0.10 0.80      3

A note on style: it's better to refrain from naming R objects with strings that are also names of R functions, such as df (density of the F distribution), dt (density of the t-distribution), data (method to load a canned dataset).

CodePudding user response：

I think this is also an option, and might be quite fast? Perhaps @jay.sf can let us know, as it also uses psamp (thanks @jay.sf)

set(dt,j="sample",value=apply(dt,1,psamp))