Home > database >  Comparing the Effect of a Variable Being Absent/Present?
Comparing the Effect of a Variable Being Absent/Present?

Time:06-08

I am working with the R programming language.

I have the the following data:

set.seed(123)

var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T)
score = rnorm(10000,10,5)

my_data = data.frame(var1,var2, var3,var4, score)

We can see the summary of unique rows for this data with the following command:

# https://stackoverflow.com/questions/34312324/r-count-all-combinations
> dt = my_data[,c(1,2,3,4)]
> setDT(dt)[,list(Count=.N) ,names(dt)]
    var1 var2 var3 var4 Count
 1:    0    0    0    0   667
 2:    0    1    0    0   601
 3:    1    1    1    1   651
 4:    0    1    1    1   608
 5:    1    0    1    1   613
 6:    1    1    0    1   588
 7:    0    1    1    0   607
 8:    0    0    1    1   607
 9:    1    0    1    0   625
10:    0    1    0    1   661
11:    1    1    1    0   635
12:    0    0    1    0   640
13:    1    1    0    0   608
14:    1    0    0    0   607
15:    0    0    0    1   626
16:    1    0    0    1   656

I want to find out the average value of "score" when some variable is "present" and the same variable is "absent". For example:

  • Contribution for Var4 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 0)

  • Contribution for Var2 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 0, var3 = 1, var4 = 1)

  • etc.

I found a very "clumsy" way to do this:

var1_present <- my_data[which(my_data$var1 == 1 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]

var1_present_score = mean(var1_present$score)

var1_absent <- my_data[which(my_data$var1 == 0 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]

var1_absent_score = mean(var1_absent$score)

var_1_contribution = var1_present_score - var1_absent_score

[1] 0.1288283

Is there someway to write a function that can look at the "contribution" of different variables to the "score"? I understand that even for 4 variables there can be many different combinations to compare - e.g. row 14 vs. row 16 : (1,0,0,0) vs. (1,0,0,1). But even for just some "contributions", is it possible to write a function that evaluates the "contribution" of variables being absent/present?

Can someone please show me how to do this?

Thanks!

CodePudding user response:

If those "contribution" are only considering something from 0 to 1 and other variables are all 1, you may try

my_data %>%
  mutate(key = var1 var2 var3 var4) %>%
  summarize(across(var1:var4, ~ mean(score[.x ==1 & key == 4]) - mean(score[.x ==0 & key == 3]) ))

       var1       var2      var3        var4
1 0.1288283 0.07935381 0.1098573 -0.09315946

CodePudding user response:

You appear to have a 2^k factorial design experiment, so named because each variable is either "on" or "off" creating 2^4 = 16 conditions. You can test for main effects and interaction effects of combinations of variables with lm.

mod1 <- lm(score ~ var1*var2*var3*var4, data = my_data)
summary(mod1)
Call:
lm(formula = score ~ var1 * var2 * var3 * var4, data = my_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.4738  -3.3164   0.0109   3.3905  18.2905 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          9.99438    0.20041  49.870   <2e-16 ***
var1                 0.16979    0.27420   0.619    0.536    
var2                 0.28658    0.28495   1.006    0.315    
var3                 0.15310    0.28074   0.545    0.586    
var4                -0.14692    0.28365  -0.518    0.604    
var1:var2           -0.02208    0.39440  -0.056    0.955    
var1:var3           -0.37453    0.39302  -0.953    0.341    
var2:var3           -0.31516    0.40026  -0.787    0.431    
var1:var4           -0.24531    0.39354  -0.623    0.533    
var2:var4            0.14085    0.40357   0.349    0.727    
var3:var4            0.44781    0.39681   1.129    0.259    
var1:var2:var3      -0.02015    0.56110  -0.036    0.971    
var1:var2:var4      -0.33879    0.56254  -0.602    0.547    
var1:var3:var4       0.07728    0.56006   0.138    0.890    
var2:var3:var4      -0.83680    0.56508  -1.481    0.139    
var1:var2:var3:var4  1.04485    0.79458   1.315    0.189    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.962 on 9984 degrees of freedom
Multiple R-squared:  0.001903,  Adjusted R-squared:  0.0004037 
F-statistic: 1.269 on 15 and 9984 DF,  p-value: 0.2123

The effects represent the contribution of each variable. Typically, you would remove the insignificant variables and construct a model with remaining variables to assess the significant variables.

CodePudding user response:

You can think about this in terms of “contrasts” in the context of a linear regression model with factor predictors. Then, you can use the marginaleffects package to estimate the average “contrast” or what you can “contribution” of each variable. (Disclaimer: I am the author.)

Here is a detailed vignette on the concept of contrasts: https://vincentarelbundock.github.io/marginaleffects/articles/mfx02_contrasts.html

library(marginaleffects)
set.seed(123)

var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T)
score = rnorm(10000,10,5)

my_data = data.frame(var1,var2, var3,var4, score)

for (v in colnames(my_data)[1:4]) {
    my_data[[v]] = factor(my_data[[v]])
}

model = lm(score ~ ., data = my_data)

cmp = comparisons(model)
summary(cmp)
#> Average contrasts 
#>   Term Contrast    Effect Std. Error  z value Pr(>|z|)   2.5 %
#> 1 var1    1 - 0  0.032344     0.1001  0.32320  0.74654 -0.1638
#> 2 var2    1 - 0  0.092966     0.1001  0.92896  0.35291 -0.1032
#> 3 var3    1 - 0 -0.004377     0.1001 -0.04373  0.96512 -0.2005
#> 4 var4    1 - 0  0.051395     0.1001  0.51360  0.60753 -0.1447
#>   97.5 %
#> 1 0.2285
#> 2 0.2891
#> 3 0.1918
#> 4 0.2475
#> 
#> Model type:  lm 
#> Prediction type:  response

Created on 2022-06-07 by the reprex package (v2.0.1)

  • Related