I am working with the R programming language.
I have the the following data:
set.seed(123)
var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T)
score = rnorm(10000,10,5)
my_data = data.frame(var1,var2, var3,var4, score)
We can see the summary of unique rows for this data with the following command:
# https://stackoverflow.com/questions/34312324/r-count-all-combinations
> dt = my_data[,c(1,2,3,4)]
> setDT(dt)[,list(Count=.N) ,names(dt)]
var1 var2 var3 var4 Count
1: 0 0 0 0 667
2: 0 1 0 0 601
3: 1 1 1 1 651
4: 0 1 1 1 608
5: 1 0 1 1 613
6: 1 1 0 1 588
7: 0 1 1 0 607
8: 0 0 1 1 607
9: 1 0 1 0 625
10: 0 1 0 1 661
11: 1 1 1 0 635
12: 0 0 1 0 640
13: 1 1 0 0 608
14: 1 0 0 0 607
15: 0 0 0 1 626
16: 1 0 0 1 656
I want to find out the average value of "score" when some variable is "present" and the same variable is "absent". For example:
Contribution for Var4 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 0)
Contribution for Var2 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 0, var3 = 1, var4 = 1)
etc.
I found a very "clumsy" way to do this:
var1_present <- my_data[which(my_data$var1 == 1 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]
var1_present_score = mean(var1_present$score)
var1_absent <- my_data[which(my_data$var1 == 0 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]
var1_absent_score = mean(var1_absent$score)
var_1_contribution = var1_present_score - var1_absent_score
[1] 0.1288283
Is there someway to write a function that can look at the "contribution" of different variables to the "score"? I understand that even for 4 variables there can be many different combinations to compare - e.g. row 14 vs. row 16 : (1,0,0,0) vs. (1,0,0,1). But even for just some "contributions", is it possible to write a function that evaluates the "contribution" of variables being absent/present?
Can someone please show me how to do this?
Thanks!
CodePudding user response:
If those "contribution" are only considering something from 0 to 1 and other variables are all 1, you may try
my_data %>%
mutate(key = var1 var2 var3 var4) %>%
summarize(across(var1:var4, ~ mean(score[.x ==1 & key == 4]) - mean(score[.x ==0 & key == 3]) ))
var1 var2 var3 var4
1 0.1288283 0.07935381 0.1098573 -0.09315946
CodePudding user response:
You appear to have a 2^k factorial design experiment, so named because each variable is either "on" or "off" creating 2^4 = 16 conditions. You can test for main effects and interaction effects of combinations of variables with lm.
mod1 <- lm(score ~ var1*var2*var3*var4, data = my_data)
summary(mod1)
Call:
lm(formula = score ~ var1 * var2 * var3 * var4, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-18.4738 -3.3164 0.0109 3.3905 18.2905
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.99438 0.20041 49.870 <2e-16 ***
var1 0.16979 0.27420 0.619 0.536
var2 0.28658 0.28495 1.006 0.315
var3 0.15310 0.28074 0.545 0.586
var4 -0.14692 0.28365 -0.518 0.604
var1:var2 -0.02208 0.39440 -0.056 0.955
var1:var3 -0.37453 0.39302 -0.953 0.341
var2:var3 -0.31516 0.40026 -0.787 0.431
var1:var4 -0.24531 0.39354 -0.623 0.533
var2:var4 0.14085 0.40357 0.349 0.727
var3:var4 0.44781 0.39681 1.129 0.259
var1:var2:var3 -0.02015 0.56110 -0.036 0.971
var1:var2:var4 -0.33879 0.56254 -0.602 0.547
var1:var3:var4 0.07728 0.56006 0.138 0.890
var2:var3:var4 -0.83680 0.56508 -1.481 0.139
var1:var2:var3:var4 1.04485 0.79458 1.315 0.189
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.962 on 9984 degrees of freedom
Multiple R-squared: 0.001903, Adjusted R-squared: 0.0004037
F-statistic: 1.269 on 15 and 9984 DF, p-value: 0.2123
The effects represent the contribution of each variable. Typically, you would remove the insignificant variables and construct a model with remaining variables to assess the significant variables.
CodePudding user response:
You can think about this in terms of “contrasts” in the context of a linear regression model with factor predictors. Then, you can use the marginaleffects
package to estimate the average “contrast” or what you can “contribution” of each variable. (Disclaimer: I am the author.)
Here is a detailed vignette on the concept of contrasts: https://vincentarelbundock.github.io/marginaleffects/articles/mfx02_contrasts.html
library(marginaleffects)
set.seed(123)
var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T)
score = rnorm(10000,10,5)
my_data = data.frame(var1,var2, var3,var4, score)
for (v in colnames(my_data)[1:4]) {
my_data[[v]] = factor(my_data[[v]])
}
model = lm(score ~ ., data = my_data)
cmp = comparisons(model)
summary(cmp)
#> Average contrasts
#> Term Contrast Effect Std. Error z value Pr(>|z|) 2.5 %
#> 1 var1 1 - 0 0.032344 0.1001 0.32320 0.74654 -0.1638
#> 2 var2 1 - 0 0.092966 0.1001 0.92896 0.35291 -0.1032
#> 3 var3 1 - 0 -0.004377 0.1001 -0.04373 0.96512 -0.2005
#> 4 var4 1 - 0 0.051395 0.1001 0.51360 0.60753 -0.1447
#> 97.5 %
#> 1 0.2285
#> 2 0.2891
#> 3 0.1918
#> 4 0.2475
#>
#> Model type: lm
#> Prediction type: response
Created on 2022-06-07 by the reprex package (v2.0.1)