For loop converting NA in factor variables into "None"-CodePudding

I want to convert the NAs in my factor variables into a string "None" that will be a level in my data set.

i have tried

for ( col in 1:ncol(data)){
  class(data$col) == "factor"
  data$col = addNA(data$col)
  levels(data$col) <- c(levels(data$col), "None")
  print(summary(data))
}

And i got this error

Unknown or uninitialised column: `col`.Unknown or uninitialised column: `col`.Error: Assigned data `addNA(cdata$col)` must be compatible with existing data.
x Existing data has 1000 rows.
x Assigned data has 0 rows.
i Only vectors of size 1 are recycled.

What is the problem in this way? What is the better way to do this for all factor columns at once rather that doing each column alone.

CodePudding user response：

We can loop across the columns that are factor, convert the NA to "None" using fct_explicit_na from forcats

library(dplyr)
library(forcats)
data <- data %>%
     mutate(across(where(is.factor), ~ fct_explicit_na(., na_level = "None")))

In the for loop, there are multiple issues

class(data$col) == "factor" is checked, but it should be inside an if(...) expression
data$col - is wrong as there are no column names with col as name, instead it would be data[[col]]
summary(data) can be checked outside the for loop

for (col in seq_along(data)){
  if(class(data[[col]]) == "factor") {
     data[[col]] = addNA(data[[col]])
     levels(data[[col]]) <- c(levels(data[[col]]), "None")    
   }
}

print(summary(data))

CodePudding user response：

Here is an alternative way:

identify which columns are factor
Add "None" to the levels of each factor
Replace NA's by "None":

Here is an example with a mock dataset:

# identify which is factor column
x <-  sapply(df, is.factor) 

df[, x] <- lapply(df[, x], function(.){
    levels(.) <- c(levels(.), "None")
    replace(., is.na(.), "None")
})

output:

  a     b         c
  <fct> <fct> <dbl>
1 1     None      2
2 None  3        NA
3 4     None     NA

data:

df <- structure(list(a = structure(c(1L, NA, 2L), .Label = c("1", "4"
), class = "factor"), b = structure(c(NA, 1L, NA), .Label = "3", class = "factor"), 
c = c(2, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))