Home > Software design >  Recoding factor with many levels
Recoding factor with many levels

Time:06-17

I need to recode a factor variable with almost 90 levels. It is trait names from database which I then need to pivot to get the dataset for analysis. Is there a way to do it automatically without typing each OldName=NewName?

This is how I do it with dplyr for fewer levels:

df$TraitName <- recode_factor(df$TraitName, 'Old Name' = "new.name")

My idea was to use a key dataframe with a column of old names and corresponding new names but I cannot figure out how to feed it to recode

CodePudding user response:

One way would be a lookup table, a join, and coalesce (to get the first non-NA value:

my_data <- data.frame(letters = letters[1:6])

levels_to_change <- data.frame(letters = letters[4:5],
                               new_letters = LETTERS[4:5])

library(dplyr)
my_data %>%
  left_join(levels_to_change) %>%
  mutate(new = coalesce(new_letters, letters))

Result

Joining, by = "letters"
  letters new_letters new
1       a        <NA>   a
2       b        <NA>   b
3       c        <NA>   c
4       d           D   D
5       e           E   E
6       f        <NA>   f

CodePudding user response:

You could quite easily create a named vector from your lookup table and pass that to recode using splicing. It might as well be faster than a join.

library(tidyverse)

# test data
df <- tibble(TraitName = c("a", "b", "c"))

# Make a lookup table with your own data
# Youll bind your two columns instead here
# youll want to keep column order to deframe it.
# column names doesnt matter. 
lookup <- tibble(old = c("a", "b", "c"), new = c("aa", "bb", "cc")) 

# Convert to named vector and splice it within the recode

df <- 
  df |>
  mutate(TraitNameRecode = recode_factor(TraitName, !!!deframe(lookup)))
  • Related