Need to transform variable below, based on category quantity in the dataset
so that categories that appear less than two times are re-named to category "other"
Data example
Desirable output
I used to use below chunk of code for such transformation but since I moved to R 4.05 it throws me an error.
levels(data$Country_of_origin) <-ifelse(table(data&Country_of_origin)>2,"OTHER",levels(data&Country_of_origin))
CodePudding user response:
Not sure why your code stopped working but forcats::fct_lump_*()
is a great option for this application. See small example here:
library(tidyverse)
d <- c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA') %>% factor()
# original distribution
table(d)
#> d
#> Germany Japan USA
#> 1 1 4
# lumpped distribution
fct_lump_min(d, min = 2) %>% table()
#> .
#> USA Other
#> 4 2
Created on 2022-02-10 by the reprex package (v2.0.1)
CodePudding user response:
Here is another option:
Packages
library(dplyr)
library(tibble)
library(magrittr)
Input
data <- tibble( country = c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA'))
data
# A tibble: 6 x 1
country
<chr>
1 USA
2 USA
3 Germany
4 Japan
5 USA
6 USA
Solution
few_country <- data %>% count(country) %>% filter(n<=2)
data %>%
mutate(new_country = case_when(country %in% few_country$country ~ "OTHER",
TRUE ~ country))
Output
# A tibble: 6 x 2
country new_country
<chr> <chr>
1 USA USA
2 USA USA
3 Germany OTHER
4 Japan OTHER
5 USA USA
6 USA USA
CodePudding user response:
Using dplyr
, you can calculate frequency after group_by(country)
and then mutate country
when below a threshold:
library(dplyr)
library(tidyr)
data <- tibble( country = c('USA', 'USA', 'Germany', 'Japan', 'USA', 'USA'))
data |>
group_by(country) |>
mutate(Freq = n()) |>
ungroup() |>
mutate(country = ifelse(Freq < 2, "OTHER", country))
# A tibble: 6 x 2
country Freq
<chr> <int>
1 USA 4
2 USA 4
3 OTHER 1
4 OTHER 1
5 USA 4
6 USA 4