Home > Back-end >  Recoding rare categories of a variable to category -"other" based on condition
Recoding rare categories of a variable to category -"other" based on condition

Time:02-11

Need to transform variable below, based on category quantity in the dataset

so that categories that appear less than two times are re-named to category "other"

Data example

enter image description here

Desirable output

enter image description here

I used to use below chunk of code for such transformation but since I moved to R 4.05 it throws me an error.

levels(data$Country_of_origin) <-ifelse(table(data&Country_of_origin)>2,"OTHER",levels(data&Country_of_origin))

CodePudding user response:

Not sure why your code stopped working but forcats::fct_lump_*() is a great option for this application. See small example here:

library(tidyverse)

d <- c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA') %>% factor()

# original distribution
table(d)
#> d
#> Germany   Japan     USA 
#>       1       1       4

# lumpped distribution
fct_lump_min(d, min = 2) %>% table()
#> .
#>   USA Other 
#>     4     2

Created on 2022-02-10 by the reprex package (v2.0.1)

CodePudding user response:

Here is another option:

Packages

library(dplyr)
library(tibble)
library(magrittr)

Input

data <- tibble( country = c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA'))

data

# A tibble: 6 x 1
  country
  <chr>  
1 USA    
2 USA    
3 Germany
4 Japan  
5 USA    
6 USA  

Solution

few_country <- data %>% count(country) %>% filter(n<=2) 
    
data %>% 
   mutate(new_country = case_when(country %in% few_country$country ~ "OTHER",
                                     TRUE ~ country))

Output

# A tibble: 6 x 2
  country new_country
  <chr>   <chr>      
1 USA     USA        
2 USA     USA        
3 Germany OTHER      
4 Japan   OTHER      
5 USA     USA        
6 USA     USA  
      

CodePudding user response:

Using dplyr, you can calculate frequency after group_by(country) and then mutate country when below a threshold:

library(dplyr)
library(tidyr)

data <- tibble( country = c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA'))

data |>
    group_by(country) |>
    mutate(Freq = n()) |>
    ungroup() |>
    mutate(country = ifelse(Freq < 2, "OTHER", country))

  # A tibble: 6 x 2
  country  Freq
  <chr>   <int>
1 USA         4
2 USA         4
3 OTHER       1
4 OTHER       1
5 USA         4
6 USA         4
  • Related