NA value not excluded in lm() in R-CodePudding

I have a dataframe with Sex (Female=1, Men=0), Race (white=1, non-white=0), among other columns. There are some missing values in both Sex and Race (both are factor variables). Below is a screenshot of the Sex variable distribution.

However, when I ran the linear regression, no missing values are dropped. Below is the regression output. As you can see, for some reason, both 0 and 1 show up for Sex and race. Does that mean R takes "NA" as the baseline? How can I fix the code so that lm() only takes in complete cases?

CodePudding user response：

I'm guessing that your "not available" data are coded as empty strings ("") rather than as NA values. R removes only NA values automatically. You could try

mydata$Sex[mydata$Sex == ""] <- NA

mydata$Sex <- factor(mydata$Sex, levels = c(0,1))

and try again ...

CodePudding user response：

you can remove all the rows with NAs with complete.cases:

all_nodes_group_merged.adj = all_nodes_group_merged[complete.cases(all_nodes_group_merged), ]

By the way I recommend to wrap factor vars as numeric:

lm(formula = Life_Satisfaction_6bp ~ as.numeric(Sex)   as.numeric(race_white)   item_count, data = all_nodes_group_merged.adj)

Factor vars in regression works in a special way, see : https://stackoverflow.com/a/30159530/11180223

Edit

You can also convert it to numeric and try if it makes some sense:

all_nodes_group_merged.adj$Sex_num = as.numeric(levels(all_nodes_group_merged.adj$Sex))[all_nodes_group_merged.adj$Sex]
all_nodes_group_merged.adj$race_white_num = as.numeric(levels(all_nodes_group_merged.adj$race_white))[all_nodes_group_merged.adj$race_white]

lm(formula = Life_Satisfaction_6bp ~ Sex_num   race_white_num   item_count, data = all_nodes_group_merged.adj)