I am working on a dataset where the column of gender has almost 120 NA's, and I thought those were actual NA's, but they aren't, and they are coming into my model when I don't need them.
I changed the gender to a factor, and then checked the levels, this is the output:
levels(data$Gender)
[1] "Female" "Male" NA
Because NA here is not in inverted commas, I assumed it is not a factor level, but it is!
Then i tried to see if it actually is NA and tried:
is.na(data$Gender)
And all the values are false! That means they are not being read as NA
by R.
So, i tried converting them with:
data <- data %>% mutate(Gender = ifelse(Gender == "NA", NA, Gender))
And what this is doing is converting my factor variable into a numeric variable and assigning my genders with 1,2, and 3. 3 for NA.
So, of course I tried the simplest method
data[data == "NA"] <- NA
This of course did not work either.
Then i tried:
replace_with_na(data, "Gender", .x~ == "NA")
and this does not work either.
I don't know what I am doing wrong. Neither do I understand why the is.na()
output is FALSE
despite the levels
command not putting it in inverted commas ("") like it does with female and male, nor do I understand my failure to convert them despite all efforts.
CodePudding user response:
First, I think it's better to not convert to factor
until you remove the contaminating values. Otherwise you have to remove those levels later.
Second, in case you have something other than "NA"
in that column, you can simply replace everything that is NOT the desired values so you don't have to enumerate all the possible wrong values.
x <- sample(c("M", "F", "NA"), 10, T)
x
#> [1] "M" "M" "M" "NA" "F" "NA" "M" "NA" "NA" "F"
x[!x %in% c("M", "F")] <- NA
x
#> [1] "M" "M" "M" NA "F" NA "M" NA NA "F"
Created on 2022-04-01 by the reprex package (v2.0.1)
CodePudding user response:
Can you try importing your data again with fread() from data.table with stringsAsFactor = FALSE
? If you data is a factor, then you can try reading it as non factor (just an idea). Since we don't have your data frame cannot test my ideas.
Check stringr
package to replace/remove NAs with str_*
functions
a simple DF with NAs can work with filter:
dx <- read_table(col_names = T,
'Gender Col1 col2
Male 1 2
NA 3 4
NA 5 6
Female 8 9')
dx%>%filter(is.na(Gender))