R lm() function won't display all data-CodePudding

I am trying to make a linear regression model using data that I have sorted into new categories. (Specifically I have taken age from the NHANES database and sorted different age ranges into generations)

When I attmept to use R's lm() function on my new data I receive an output that accounts for all but one set of generational data which I will show and explain below.

library(tidyverse)
library(janitor)
library(NHANES) 
data(NHANES)
View(NHANES)
help(NHANES)

Database <- NHANES %>% 
  select(SleepHrsNight, BMI, AgeDecade, HHIncome, Age) %>%   # select variables of interest
  drop_na() # remove any rows with NA's to leave only complete observations

Database%>%
  ggplot(aes(x = SleepHrsNight,
             y=BMI)) 
  geom_point() 
  labs(x = "Quanity of Sleep (hours)", 
       y = "BMI Level", 
       title = "Quantity of Sleep vs. BMI")

cor(Database$BMI, Database$SleepHrsNight)


view(Database)
################### THIS IS THE CODE SORTS MY AGE DATA INTO GENERATIONS
Database$AgeGeneration <- ifelse(Database$Age >= 10 & Database$Age <= 25,"Gen Z",
                                 ifelse(Database$Age >= 26 & Database$Age <=41, "Millenials",
                                        ifelse(Database$Age >= 42 & Database$Age <= 57, "Gen X",
                                               ifelse( Database$Age > 57, "Baby Boomers",0))))


BMI_SleepHrsNight_AgeGeneration_model = Database %>%
  lm(BMI ~ SleepHrsNight   AgeGeneration, data = .)
summary(BMI_SleepHrsNight_AgeGeneration_model)

Regression_model <- Database %>%
  lm(BMI~SleepHrsNight AgeGeneration,.)
summary(Regression_model)

THIS IS THE OUTPUT

Call:
lm(formula = BMI ~ SleepHrsNight   AgeGeneration, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.389  -4.616  -1.251   3.479  53.592 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             30.35296    0.46202  65.697   <2e-16 ***
SleepHrsNight           -0.13893    0.06169  -2.252   0.0244 *  
AgeGenerationGen X      -0.29029    0.22710  -1.278   0.2012    
AgeGenerationGen Z      -2.78956    0.25964 -10.744   <2e-16 ***
AgeGenerationMillenials -0.38842    0.22671  -1.713   0.0867 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.706 on 6726 degrees of freedom
Multiple R-squared:  0.02216,   Adjusted R-squared:  0.02158 
F-statistic: 38.11 on 4 and 6726 DF,  p-value: < 2.2e-16

the code above is missing data from the "Baby Boomers" and I have no idea why. When I view the database the Baby Boomer data shows up but for some reason it seems to not exist when I summarize the lm() function. I also used this method on a set of data that was made in an identical way and I recieved the same issue. I am very new to R and statistics so I am not familiar enough with the language to figure this out. Any help would be appreciated, thank you.

CodePudding user response：

I am not sure if this answers your question. I feel it is perhaps more about statistics than about code. Bay Boomers was actually there as the reference group.

However, some of your other code can be improved:

For generating AgeGeneration, I would use:

Database$AgeGeneration = cut(Database$Age, breaks = c(15, 25, 41, 57, 80), 
                            labels = c("Gen Z", "Millenials", "Gen X", "Baby Boomers"))

If you want to change the reference group to show the coefficient for Baby Boomers, you could just change the levels of AgeGeneration:

Database$AgeGeneration = cut(Database$Age, breaks = c(15, 25, 41, 57, 80), 
                            labels = c("Gen Z", "Millenials", "Gen X", "Baby Boomers"), 
                            levels = c("Millenials", "Gen X", "Baby Boomers", "Gen Z")
                            )

With Millenials as reference in this case, you will see the coefficient for Baby Boomers.

Regression_model <- Database %>%
    lm(BMI~SleepHrsNight AgeGeneration,.)

summary(Regression_model)
Call:
lm(formula = BMI ~ SleepHrsNight   AgeGeneration, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.389  -4.616  -1.251   3.479  53.592 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                27.56341    0.48416  56.930   <2e-16 ***
SleepHrsNight              -0.13893    0.06169  -2.252   0.0244 *  
AgeGenerationMillenials    2.40114    0.24732   9.709   <2e-16 ***
AgeGenerationGen X         2.49927    0.24808  10.074   <2e-16 ***
AgeGenerationBaby Boomers  2.78956    0.25964  10.744   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.706 on 6726 degrees of freedom
Multiple R-squared:  0.02216,   Adjusted R-squared:  0.02158 
F-statistic: 38.11 on 4 and 6726 DF,  p-value: < 2.2e-16

CodePudding user response：

If you want Baby Boomers category to appear, then you should remove the intercept from the regression.

lm(BMI ~ -1   SleepHrsNight   AgeGeneration, data = .)