In my data
below, I want to replace any value in a column (excluding the first column) that occurs less than two times (ex. 'greek'
in column L1
, and 'german'
in column L2
) to "others"
.
I have tried the following, but don't get the desired output. Is there a short and efficient way to do this in R?
data <- data.frame(study=c('a','a','b','c','c','d'),
L1= c('arabic','turkish','greek','arabic','turkish','turkish'),
L2= c(rep('english',5),'german'))
# I tried the following without success:
dd[-1] <- lapply(names(dd)[-1], function(i) ifelse(table(dd[[i]]) < 2,"others",dd[[i]]))
CodePudding user response:
forcats
has specific function for this:
dd = data
dd[-1] = lapply(dd[-1], forcats::fct_lump_min, min = 2, other_level = "others")
dd
# study L1 L2
# 1 a arabic english
# 2 a turkish english
# 3 b others english
# 4 c arabic english
# 5 c turkish english
# 6 d turkish others
Your approach fails because ifelse()
returns a vector the same length as the test, which in your case is the table
, but the way you are using it you are assigning to the whole column so it needs to return something the same length as the whole column.
We can fix it like this:
dd[-1] <- lapply(names(dd)[-1], function(i) {
tt = table(dd[[i]])
drop = names(tt)[tt <= 2]
ifelse(dd[[i]] %in% drop, "others", dd[[i]])
})