I would like to replace entries in a dataframe elementwise on the condition that they do not belong to a set of valid possible entries. This is because the other (string) entries in the dataframe are not known ahead of time.
If I attempt to assign NA
to the subset df[!(df %in% valid_entries)]
, all entries in the dataframe are replaced with NA
, as opposed to only the elements that satisfy the condition (the kind of behaviour I would expect if I were dealing with a matrix
as opposed to a data.frame
).
How can I achieve the desired behaviour with my data.frame
ideally not using functions outside base
R?
set.seed(123); N <- 100; valid_entries <- c("GOOD", "BAD")
df <- data.frame(A = sample(valid_entries, N, TRUE, c(0.4, 0.6)),
B = sample(valid_entries, N, TRUE, c(0.7, 0.3)))
df[2, 2] <- "Missing"
df[3, 1] <- "NotAvailable"
head(df)
# %in% does not work -> Replaces all with NA
df[!(df %in% valid_entries)] <- NA
head(df, n = 4)
# A B
# 1 NA NA
# 2 NA NA
# 3 NA NA
# 4 NA NA
CodePudding user response:
You might need to apply over the columns:
df[apply(df, 2, \(x) !x %in% valid_entries)] <- NA
output
> head(df)
A B
1 BAD GOOD
2 GOOD <NA>
3 <NA> GOOD
4 GOOD BAD
5 GOOD GOOD
6 BAD BAD
Note: \
can replace function
in lambda-like functions since R 4.1.
CodePudding user response:
%in%
does not work with dataframe, so you may have to use one of the apply commands.
Here's an option with sapply
-
valid_entries <- c('GOOD', 'BAD')
df[!sapply(df, `%in%`, valid_entries)] <- NA
head(df)
# A B
#1 BAD GOOD
#2 GOOD <NA>
#3 <NA> GOOD
#4 GOOD BAD
#5 GOOD GOOD
#6 BAD BAD
If there are limited values in valid_entries
which you can enter by hand you may use ==
(or !=
in this case) and combine then with &
.
df[df != 'GOOD' & df != 'BAD'] <- NA
df
CodePudding user response:
df[df != 'GOOD' & df != 'BAD'] <- NA df