Home > Software design >  how to change value with repetition less than a specific number to "other" in dataframe
how to change value with repetition less than a specific number to "other" in dataframe

Time:12-13

I have a data frame that has more than 20 types values in its "S.A" column. I showed a sample of the column below:

structure(list(`temp$S.A[1:30]` = c("Yaletown", "Fairview VW", 
"West End VW", "Fairview VW", "Downtown VW", "Hastings", "Yaletown", 
"Main", "Marpole", "West End VW", "Yaletown", "Yaletown", "Kitsilano", 
"Hastings East", "Grandview VE", "Grandview Woodland", "Downtown VW", 
"Downtown VW", "West End VW", "Downtown VE", "West End VW", "West End VW", 
"West End VW", "Yaletown", "Downtown VW", "West End VW", "Downtown VW", 
"West End VW", "Yaletown", "West End VW")), row.names = c(NA, 
-30L), class = "data.frame") 

if I use table function, I get the result shown below which shows all possible values for S.A in my dataframe:

enter image description here

Now, what I want to do is to Replace names with repetition less than 100 with "other". For example, in the values shown below, "Arbutus" is repeated less than 100 times, so I want to change all "Arbutus" values to "other" in order to reduce the number of variables. I tried this code to find the names:

    aa <- as.data.frame(table(temp$S.A))
    bb <- subset(aa, aa$Freq < 100)
    cc <- bb[1]

This helps me to find the names, however, I am not sure how to continue and replace them.

CodePudding user response:

To continue working with what you have you may use -

temp$S.A[temp$S.A %in% cc] <- 'Other'

to change all the values available in cc to "Other".


However, forcats has a function to do this fct_lump_min.

tmp$S.A <- forcats::fct_lump_min(tmp$S.A, 100)
  • Related