Multiple Linear Regression with Categorical independent Variables ( with subset function)-CodePudding

I am a beginner in R and I have been trying to build my first multiple linear regression model. In this model, I am trying to know whether the dependent variable meanf0 values are different across VowelPositions(independent variable) in just disyllabic words (disyllabic words are found in the SyllableCount variable, which contains two levels: disyllabic and trisyllabic) and in a specific SyllabicType"open"(SyllabicType is another independent predictor variable that contains two levels: "open" and "closed"). I am stuck on how to build a model with just some categories of given independent variable if that is possible? here is my tentative model:

model_F0_disyll <- lm (data=QP1_subset_norm,     
                       meanf0_norm~SyllableCount  syllableType VowelPosition,
                       subset(SyllableCount=="2" & syllableType=="open"))

but it does seem to work. Thank you in advance for your guidance!

CodePudding user response：

I think you would need something like:

model_F0_disyll <- lm (data=QP1_subset_norm,     
                       meanf0_norm~SyllableCount  syllableType VowelPosition,
                       subset = (SyllableCount==2 & syllableType=="open"))

The subset argument is just the expression used to make the subset, you don't need to call the subset() function. Further, when the variable is numeric (presumably like SyllableCount, you could use either a numeric or string value. That is SyllableCount == "2" and SyllableCount == 2 both work.

Here's an example with the mtcars data:

mod <- lm(mpg ~ hp   wt, data=mtcars, subset=(am == "1" & cyl == 4))
summary(mod)
#> 
#> Call:
#> lm(formula = mpg ~ hp   wt, data = mtcars, subset = (am == "1" & 
#>     cyl == 4))
#> 
#> Residuals:
#>     Datsun 710       Fiat 128    Honda Civic Toyota Corolla      Fiat X1-9 
#>       -2.66851        4.18787       -2.61455        3.25523       -2.62538 
#>  Porsche 914-2   Lotus Europa     Volvo 142E 
#>       -0.77799        1.17181        0.07154 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 47.24552    6.57304   7.188 0.000811 ***
#> hp          -0.07288    0.05695  -1.280 0.256814    
#> wt          -6.46508    3.15205  -2.051 0.095512 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.193 on 5 degrees of freedom
#> Multiple R-squared:  0.6378, Adjusted R-squared:  0.493 
#> F-statistic: 4.403 on 2 and 5 DF,  p-value: 0.07893

^{Created on 2022-06-16 by the reprex package (v2.0.1)}

CodePudding user response：

I tried the code based on your suggestion but I got an error, which says:

model_F0_disyll <- lm (meanf0_norm~SyllableCount VowelPosition,data=QP1_subset_norm_1, subset=(SyllableCount=="2"))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1   isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
```r
And then when I tried the summary() I got this where the subset () was ignored by R I guess:
```r
Call:
lm(formula = meanf0_norm ~ VowelPosition   SyllableCount, data = QP1_subset_norm)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4119 -0.6489 -0.1001  0.6089 11.6656 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.46669    0.01973  23.659   <2e-16 ***
VowelPositionpen -0.40440    0.03285 -12.309   <2e-16 ***
VowelPositionfi  -0.99317    0.02347 -42.323   <2e-16 ***
SyllableCount3   -0.05940    0.02333  -2.546   0.0109 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8925 on 7033 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.2032,    Adjusted R-squared:  0.2029 
F-statistic: 597.9 on 3 and 7033 DF,  p-value: < 2.2e-16
```r

<sup>Created on 2022-06-16 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>