Home > OS >  NA value not excluded in lm() in R
NA value not excluded in lm() in R

Time:03-25

I have a dataframe with Sex (Female=1, Men=0), Race (white=1, non-white=0), among other columns. There are some missing values in both Sex and Race (both are factor variables). Below is a screenshot of the Sex variable distribution.

enter image description here

However, when I ran the linear regression, no missing values are dropped. Below is the regression output. As you can see, for some reason, both 0 and 1 show up for Sex and race. Does that mean R takes "NA" as the baseline? How can I fix the code so that lm() only takes in complete cases?

enter image description here

CodePudding user response:

I'm guessing that your "not available" data are coded as empty strings ("") rather than as NA values. R removes only NA values automatically. You could try

mydata$Sex[mydata$Sex == ""] <- NA

or

mydata$Sex <- factor(mydata$Sex, levels = c(0,1))

and try again ...

CodePudding user response:

you can remove all the rows with NAs with complete.cases:

all_nodes_group_merged.adj = all_nodes_group_merged[complete.cases(all_nodes_group_merged), ]

By the way I recommend to wrap factor vars as numeric:

lm(formula = Life_Satisfaction_6bp ~ as.numeric(Sex)   as.numeric(race_white)   item_count, data = all_nodes_group_merged.adj)

Factor vars in regression works in a special way, see : https://stackoverflow.com/a/30159530/11180223

Edit

You can also convert it to numeric and try if it makes some sense:

all_nodes_group_merged.adj$Sex_num = as.numeric(levels(all_nodes_group_merged.adj$Sex))[all_nodes_group_merged.adj$Sex]
all_nodes_group_merged.adj$race_white_num = as.numeric(levels(all_nodes_group_merged.adj$race_white))[all_nodes_group_merged.adj$race_white]

lm(formula = Life_Satisfaction_6bp ~ Sex_num   race_white_num   item_count, data = all_nodes_group_merged.adj)
  • Related