Why is my summary in R only including some of my variables?-CodePudding

I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.

SAMPLE DATA SET:

 Calls<- c("55","60","180","160","110","50") 
 Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
 q<-data.frame(Calls, Pup)
 q
 q1<-lm(Calls~Pup, data=q)
 summary(q1)

OUTPUT AND ERROR MESSAGE FROM SAMPLE:

> Calls    Pup
1    55    Pre
2    60 Middle
3   180   Post
4   160   Post
5   110 Middle
6    50    Pre

Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors

ACTUAL INPUT FOR MY ANALYSIS:

> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)

THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:

Residuals:
Min     1Q Median     3Q    Max 
-66.40 -37.63 -26.02  -5.39 299.93 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)  
 (Intercept)    66.54      35.82   1.858   0.0734 .
PupPost       -51.98      48.50  -1.072   0.2927  
PupPre        -26.47      39.86  -0.664   0.5118  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared:  0.03822,   Adjusted R-squared:  -0.02811 
F-statistic: 0.5762 on 2 and 29 DF,  p-value: 0.5683

Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.

CodePudding user response：

For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor

> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)

Call:
lm(formula = Calls ~ Pup, data = q)

Residuals:
    1     2     3     4     5     6 
  2.5 -25.0  10.0 -10.0  25.0  -2.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    85.00      15.61   5.444   0.0122 *
PupPost        85.00      22.08   3.850   0.0309 *
PupPre        -32.50      22.08  -1.472   0.2374  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared:  0.9097,    Adjusted R-squared:  0.8494 
F-statistic:  15.1 on 2 and 3 DF,  p-value: 0.02716

If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.

summary(lm(Calls~Pup-1, data=q))

Call:
lm(formula = Calls ~ Pup - 1, data = q)

Residuals:
    1     2     3     4     5     6 
  2.5 -25.0  10.0 -10.0  25.0  -2.5 

Coefficients:
          Estimate Std. Error t value Pr(>|t|)   
PupMiddle    85.00      15.61   5.444  0.01217 * 
PupPost     170.00      15.61  10.889  0.00166 **
PupPre       52.50      15.61   3.363  0.04365 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared:  0.9815,    Adjusted R-squared:  0.9631 
F-statistic: 53.17 on 3 and 3 DF,  p-value: 0.004234

CodePudding user response：

If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:

q1<-lm(Calls~Pup - 1, data=q)