Change the value of a low frequency column to a desired value-CodePudding

In my data below, I want to replace any value in a column (excluding the first column) that occurs less than two times (ex. 'greek' in column L1, and 'german' in column L2) to "others".

I have tried the following, but don't get the desired output. Is there a short and efficient way to do this in R?

data <- data.frame(study=c('a','a','b','c','c','d'),
L1= c('arabic','turkish','greek','arabic','turkish','turkish'),
L2= c(rep('english',5),'german'))

# I tried the following without success:

dd[-1] <- lapply(names(dd)[-1], function(i) ifelse(table(dd[[i]]) < 2,"others",dd[[i]]))

CodePudding user response：

forcats has specific function for this:

dd = data
dd[-1] = lapply(dd[-1], forcats::fct_lump_min, min = 2, other_level = "others")
dd
#  study      L1      L2
# 1     a  arabic english
# 2     a turkish english
# 3     b  others english
# 4     c  arabic english
# 5     c turkish english
# 6     d turkish  others

Your approach fails because ifelse() returns a vector the same length as the test, which in your case is the table, but the way you are using it you are assigning to the whole column so it needs to return something the same length as the whole column.

We can fix it like this:

dd[-1] <- lapply(names(dd)[-1], function(i) {
  tt = table(dd[[i]])
  drop = names(tt)[tt <= 2]
  ifelse(dd[[i]] %in% drop, "others", dd[[i]])
})