I am doing prevalence estimates from my raw data.table by brute force and I need to be more efficient. Can you help?
My data.table contains one weighted observation per row. There are many columns acting as binary dummy variables indicating if the particular observation belongs to one or more of many possible classifications. (e.g., a story could be 'amazing', 'boring', or 'charming', or any combination of the three.)
There's got to be a data.table
way to replace my forloop. I also suspect that I might not need to necessarily generate the queries
set. I appreciate a fresh set of eyes on this problem.
library(data.table)
set.seed(42)
# I have many weighted observations that can be labeled as belonging to one of many categories
# in this example, I simulate 10 observations and only 3 categories
dt = data.table(
weight = runif( n = 10 , min = 0, max = 1 ),
a = sample( x = c(0,1) , size = 10 , replace = TRUE ),
b = sample( x = c(0,1) , size = 10 , replace = TRUE ),
c = sample( x = c(0,1) , size = 10 , replace = TRUE )
)
# Generate all combinations of categories
queries = as.data.table( expand.grid( rep( list(0:1) , length(names(dt))-1 ) ) )
names(queries) = names(dt)[ 2:length(names(dt)) ] # rename Var1, Var2, Var3 to a, b, c
# Brute force through each possible combination to calculate prevalence
prevalence = rep( NA, nrow(queries) )
for( q in 1:nrow(queries) ){
prevalence[q] = dt[ a == queries[q, a] & b == queries[q, b] & c == queries[q, c] , sum(weight) ] / dt[ , sum(weight) ]
}
results = copy(queries)
results$prevalence = prevalence
results
The output is:
# a b c prevalence
#1: 0 0 0 0.09771385
#2: 1 0 0 0.10105192
#3: 0 1 0 0.36229784
#4: 1 1 0 0.00000000
#5: 0 0 1 0.00000000
#6: 1 0 1 0.05993197
#7: 0 1 1 0.00000000
#8: 1 1 1 0.37900443
Updated: The original question had 42 simulated observations and the data covered each possible combination of categories (a, b, c). The question was revised to only include 10 simulated observations so there would be combinations with no observations (and zero prevalence).
CodePudding user response:
You can calculate it by group
dt[,.( prevalence = sum(weight) / dt[,sum(weight)] ), by = .(a,b,c)]
- each group corresponds to your categories
- sum the
weight
of each group then divide it by totoal weight
CodePudding user response:
Here are some solutions (in both cases, you can replace keyby
argument with by
)
If your dataset (dt
) already contains all possible combinations of the different categories, then you could do (as in @Peace Wang solution)
dt[, .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c)]
# a b c prevalence
# 1: 0 0 0 0.10876301
# 2: 0 0 1 0.02135357
# 3: 0 1 0 0.03775363
# 4: 0 1 1 0.12806864
# 5: 1 0 0 0.18204696
# 6: 1 0 1 0.15197811
# 7: 1 1 0 0.25629705
# 8: 1 1 1 0.11373903
Instead, if the dataset does not contain all possible combinations of the different categories, then you could solve it as follows (CJ(a, b, c, unique=TRUE)
computes all combinations and remove duplicates)
dt[CJ(a, b, c, unique=TRUE), .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c), on=.(a, b, c)]
# a b c prevalence
# 1: 0 0 0 0.10876301
# 2: 0 0 1 0.02135357
# 3: 0 1 0 0.03775363
# 4: 0 1 1 0.12806864
# 5: 1 0 0 0.18204696
# 6: 1 0 1 0.15197811
# 7: 1 1 0 0.25629705
# 8: 1 1 1 0.11373903