Replace dataframe entries elementwise conditionally using negated %in%-CodePudding

I would like to replace entries in a dataframe elementwise on the condition that they do not belong to a set of valid possible entries. This is because the other (string) entries in the dataframe are not known ahead of time.

If I attempt to assign NA to the subset df[!(df %in% valid_entries)], all entries in the dataframe are replaced with NA, as opposed to only the elements that satisfy the condition (the kind of behaviour I would expect if I were dealing with a matrix as opposed to a data.frame).

How can I achieve the desired behaviour with my data.frame ideally not using functions outside base R?

set.seed(123); N <- 100; valid_entries <- c("GOOD", "BAD")
df <- data.frame(A = sample(valid_entries, N, TRUE, c(0.4, 0.6)), 
                 B = sample(valid_entries, N, TRUE, c(0.7, 0.3)))
df[2, 2]  <- "Missing"
df[3, 1] <- "NotAvailable"
head(df)

# %in% does not work -> Replaces all with NA
df[!(df %in% valid_entries)] <- NA
head(df, n = 4)
#    A  B
# 1 NA NA
# 2 NA NA
# 3 NA NA
# 4 NA NA

CodePudding user response：

You might need to apply over the columns:

df[apply(df, 2, \(x) !x %in% valid_entries)] <- NA

output

> head(df)
     A    B
1  BAD GOOD
2 GOOD <NA>
3 <NA> GOOD
4 GOOD  BAD
5 GOOD GOOD
6  BAD  BAD

_{Note: \ can replace function in lambda-like functions since R 4.1.}

CodePudding user response：

%in% does not work with dataframe, so you may have to use one of the apply commands.

Here's an option with sapply -

valid_entries <- c('GOOD', 'BAD')
df[!sapply(df, `%in%`, valid_entries)] <- NA
head(df)

#     A    B
#1  BAD GOOD
#2 GOOD <NA>
#3 <NA> GOOD
#4 GOOD  BAD
#5 GOOD GOOD
#6  BAD  BAD

If there are limited values in valid_entries which you can enter by hand you may use == (or != in this case) and combine then with &.

df[df != 'GOOD' & df != 'BAD'] <- NA
df

CodePudding user response：

df[df != 'GOOD' & df != 'BAD'] <- NA df