I'm attempting to write a function which compares the factor columns with the same columns names in two data frames. The function fails to return the correct result which should be NA's in d2 columns c1 and c2 respectively for z and zz fields. The function should identify rows in data frame d2 column c1 and c2 not in data frame d1 column c1 and c2 respectively, replace these values with NA.
`
c1 <- c("A", "B", "C", "D", "E")
c2 <- c("AA", "BB", "CC", "DD", "EE")
d1 <- data.frame(c1, c2)
c1 <- c("z", "B", "C", "D", "E")
c2 <- c("AA", "zz", "CC", "DD", "EE")
d2 <- data.frame(c1, c2)
v <- colnames(d1)
replace <- NA
x <- d2[v]
repFact = function(x, d1, replace){
x1 <- unique(d1[,v])
y <- x
id <- which(!(y %in% x1))
x[id, v] <- NA
x
return(x)
}
d2[v] <- lapply(d2[v], repFact, d1[v], replace)
`
I'm using this R code to prepare prediction data and am attempting to remove unseen factor levels in d2, replacing them with NA or a seen factor level so the prediction function (Caret) does not fail.
Any ideas are appreciated, however, I'd like to retain the use of the which and lapply functions if possible.
CodePudding user response:
We may use Map
instead of lapply
if we want to replace the corresponding column value in 'd2' based on the 'd1' column. Modified the repFact
function as well
repFact <- function(x, y, replaceVal)
{
replace(y, y %in% setdiff(y, x), replaceVal)
}
-testing
d2[v] <- Map(repFact, d1[v], d2[v], MoreArgs = list(replaceVal = NA))
> d2
c1 c2
1 <NA> AA
2 B <NA>
3 C CC
4 D DD
5 E EE
In addition, we can also use tidyverse
to do this by mutate
ing across
the columns specified in v
for 'd2' and then apply the repFact
with 'd1' corresponding column as in put (cur_column()
- returns the column name)
library(dplyr)
d2 <- d2 %>%
mutate(across(all_of(v), ~ repFact(.x, d1[[cur_column()]], NA)))
d2
c1 c2
1 <NA> AA
2 B <NA>
3 C CC
4 D DD
5 E EE