I have a dataframe with Sex (Female=1, Men=0), Race (white=1, non-white=0), among other columns. There are some missing values in both Sex and Race (both are factor variables). Below is a screenshot of the Sex variable distribution.
However, when I ran the linear regression, no missing values are dropped. Below is the regression output. As you can see, for some reason, both 0 and 1 show up for Sex and race. Does that mean R takes "NA" as the baseline? How can I fix the code so that lm() only takes in complete cases?
CodePudding user response:
I'm guessing that your "not available" data are coded as empty strings (""
) rather than as NA
values. R removes only NA
values automatically. You could try
mydata$Sex[mydata$Sex == ""] <- NA
or
mydata$Sex <- factor(mydata$Sex, levels = c(0,1))
and try again ...
CodePudding user response:
you can remove all the rows with NAs with complete.cases
:
all_nodes_group_merged.adj = all_nodes_group_merged[complete.cases(all_nodes_group_merged), ]
By the way I recommend to wrap factor vars as numeric:
lm(formula = Life_Satisfaction_6bp ~ as.numeric(Sex) as.numeric(race_white) item_count, data = all_nodes_group_merged.adj)
Factor vars in regression works in a special way, see : https://stackoverflow.com/a/30159530/11180223
Edit
You can also convert it to numeric and try if it makes some sense:
all_nodes_group_merged.adj$Sex_num = as.numeric(levels(all_nodes_group_merged.adj$Sex))[all_nodes_group_merged.adj$Sex]
all_nodes_group_merged.adj$race_white_num = as.numeric(levels(all_nodes_group_merged.adj$race_white))[all_nodes_group_merged.adj$race_white]
lm(formula = Life_Satisfaction_6bp ~ Sex_num race_white_num item_count, data = all_nodes_group_merged.adj)