Prevalence Estimates from Observations in data.table Containing Many Binary Classification Columns-CodePudding

I am doing prevalence estimates from my raw data.table by brute force and I need to be more efficient. Can you help?

My data.table contains one weighted observation per row. There are many columns acting as binary dummy variables indicating if the particular observation belongs to one or more of many possible classifications. (e.g., a story could be 'amazing', 'boring', or 'charming', or any combination of the three.)

There's got to be a data.table way to replace my forloop. I also suspect that I might not need to necessarily generate the queries set. I appreciate a fresh set of eyes on this problem.

library(data.table)

set.seed(42)
# I have many weighted observations that can be labeled as belonging to one of many categories
# in this example, I simulate 10 observations and only 3 categories
dt = data.table(
        weight = runif( n = 10 , min = 0, max = 1 ),
        a = sample( x = c(0,1) , size = 10 , replace = TRUE ),
        b = sample( x = c(0,1) , size = 10 , replace = TRUE ),
        c = sample( x = c(0,1) , size = 10 , replace = TRUE )
)

# Generate all combinations of categories
queries = as.data.table( expand.grid( rep( list(0:1) , length(names(dt))-1 ) ) )
names(queries) = names(dt)[ 2:length(names(dt)) ] # rename Var1, Var2, Var3 to a, b, c

# Brute force through each possible combination to calculate prevalence
prevalence = rep( NA, nrow(queries) )
for( q in 1:nrow(queries) ){
    prevalence[q] = dt[ a == queries[q, a] & b == queries[q, b] & c == queries[q, c] , sum(weight) ] / dt[ , sum(weight) ]
}

results = copy(queries)
results$prevalence = prevalence

results

The output is:

#   a b c prevalence
#1: 0 0 0 0.09771385
#2: 1 0 0 0.10105192
#3: 0 1 0 0.36229784
#4: 1 1 0 0.00000000
#5: 0 0 1 0.00000000
#6: 1 0 1 0.05993197
#7: 0 1 1 0.00000000
#8: 1 1 1 0.37900443

Updated: The original question had 42 simulated observations and the data covered each possible combination of categories (a, b, c). The question was revised to only include 10 simulated observations so there would be combinations with no observations (and zero prevalence).

CodePudding user response：

You can calculate it by group

dt[,.( prevalence = sum(weight) / dt[,sum(weight)] ), by = .(a,b,c)]

each group corresponds to your categories
sum the weight of each group then divide it by totoal weight

CodePudding user response：

Here are some solutions (in both cases, you can replace keyby argument with by)

If your dataset (dt) already contains all possible combinations of the different categories, then you could do (as in @Peace Wang solution)

dt[, .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c)]

#        a     b     c prevalence
# 1:     0     0     0 0.10876301
# 2:     0     0     1 0.02135357
# 3:     0     1     0 0.03775363
# 4:     0     1     1 0.12806864
# 5:     1     0     0 0.18204696
# 6:     1     0     1 0.15197811
# 7:     1     1     0 0.25629705
# 8:     1     1     1 0.11373903

Instead, if the dataset does not contain all possible combinations of the different categories, then you could solve it as follows (CJ(a, b, c, unique=TRUE) computes all combinations and remove duplicates)

dt[CJ(a, b, c, unique=TRUE), .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c), on=.(a, b, c)]

#        a     b     c prevalence
# 1:     0     0     0 0.10876301
# 2:     0     0     1 0.02135357
# 3:     0     1     0 0.03775363
# 4:     0     1     1 0.12806864
# 5:     1     0     0 0.18204696
# 6:     1     0     1 0.15197811
# 7:     1     1     0 0.25629705
# 8:     1     1     1 0.11373903