I am trying to make a linear regression model using data that I have sorted into new categories. (Specifically I have taken age from the NHANES database and sorted different age ranges into generations)
When I attmept to use R's lm() function on my new data I receive an output that accounts for all but one set of generational data which I will show and explain below.
library(tidyverse)
library(janitor)
library(NHANES)
data(NHANES)
View(NHANES)
help(NHANES)
Database <- NHANES %>%
select(SleepHrsNight, BMI, AgeDecade, HHIncome, Age) %>% # select variables of interest
drop_na() # remove any rows with NA's to leave only complete observations
Database%>%
ggplot(aes(x = SleepHrsNight,
y=BMI))
geom_point()
labs(x = "Quanity of Sleep (hours)",
y = "BMI Level",
title = "Quantity of Sleep vs. BMI")
cor(Database$BMI, Database$SleepHrsNight)
view(Database)
################### THIS IS THE CODE SORTS MY AGE DATA INTO GENERATIONS
Database$AgeGeneration <- ifelse(Database$Age >= 10 & Database$Age <= 25,"Gen Z",
ifelse(Database$Age >= 26 & Database$Age <=41, "Millenials",
ifelse(Database$Age >= 42 & Database$Age <= 57, "Gen X",
ifelse( Database$Age > 57, "Baby Boomers",0))))
BMI_SleepHrsNight_AgeGeneration_model = Database %>%
lm(BMI ~ SleepHrsNight AgeGeneration, data = .)
summary(BMI_SleepHrsNight_AgeGeneration_model)
Regression_model <- Database %>%
lm(BMI~SleepHrsNight AgeGeneration,.)
summary(Regression_model)
THIS IS THE OUTPUT
Call:
lm(formula = BMI ~ SleepHrsNight AgeGeneration, data = .)
Residuals:
Min 1Q Median 3Q Max
-14.389 -4.616 -1.251 3.479 53.592
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.35296 0.46202 65.697 <2e-16 ***
SleepHrsNight -0.13893 0.06169 -2.252 0.0244 *
AgeGenerationGen X -0.29029 0.22710 -1.278 0.2012
AgeGenerationGen Z -2.78956 0.25964 -10.744 <2e-16 ***
AgeGenerationMillenials -0.38842 0.22671 -1.713 0.0867 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.706 on 6726 degrees of freedom
Multiple R-squared: 0.02216, Adjusted R-squared: 0.02158
F-statistic: 38.11 on 4 and 6726 DF, p-value: < 2.2e-16
the code above is missing data from the "Baby Boomers" and I have no idea why. When I view the database the Baby Boomer data shows up but for some reason it seems to not exist when I summarize the lm() function. I also used this method on a set of data that was made in an identical way and I recieved the same issue. I am very new to R and statistics so I am not familiar enough with the language to figure this out. Any help would be appreciated, thank you.
CodePudding user response:
I am not sure if this answers your question. I feel it is perhaps more about statistics than about code. Bay Boomers
was actually there as the reference group.
However, some of your other code can be improved:
For generating AgeGeneration
, I would use:
Database$AgeGeneration = cut(Database$Age, breaks = c(15, 25, 41, 57, 80),
labels = c("Gen Z", "Millenials", "Gen X", "Baby Boomers"))
If you want to change the reference group to show the coefficient for Baby Boomers
, you could just change the levels
of AgeGeneration:
Database$AgeGeneration = cut(Database$Age, breaks = c(15, 25, 41, 57, 80),
labels = c("Gen Z", "Millenials", "Gen X", "Baby Boomers"),
levels = c("Millenials", "Gen X", "Baby Boomers", "Gen Z")
)
With Millenials as reference in this case, you will see the coefficient for Baby Boomers.
Regression_model <- Database %>%
lm(BMI~SleepHrsNight AgeGeneration,.)
summary(Regression_model)
Call:
lm(formula = BMI ~ SleepHrsNight AgeGeneration, data = .)
Residuals:
Min 1Q Median 3Q Max
-14.389 -4.616 -1.251 3.479 53.592
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.56341 0.48416 56.930 <2e-16 ***
SleepHrsNight -0.13893 0.06169 -2.252 0.0244 *
AgeGenerationMillenials 2.40114 0.24732 9.709 <2e-16 ***
AgeGenerationGen X 2.49927 0.24808 10.074 <2e-16 ***
AgeGenerationBaby Boomers 2.78956 0.25964 10.744 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.706 on 6726 degrees of freedom
Multiple R-squared: 0.02216, Adjusted R-squared: 0.02158
F-statistic: 38.11 on 4 and 6726 DF, p-value: < 2.2e-16
CodePudding user response:
If you want Baby Boomers
category to appear, then you should remove the intercept from the regression.
lm(BMI ~ -1 SleepHrsNight AgeGeneration, data = .)