Home > Software engineering >  convert factor levels from factor to double in multiple columns
convert factor levels from factor to double in multiple columns

Time:08-31

I have a data frame in which I would like to change multiple columns. They are factors and there are many different levels, the data frame is similar to this:

 myLs <- c("-77", 
                "0",
                "0 Stunden", 
                "30 Minuten",
                "1 Stunde",
                "1 1/2 Stunden",
                "2", 
                "2 1/2",
                "3", 
                "3 1/2",
                "4", 
                "4 1/2",
                "5", 
                "5 1/2",
                "6", 
                "6 1/2",
                "7", 
                "7 1/2",
                "8", 
                "8 1/2",
                "9", 
                "9 1/2",
                "10", 
                "10 1/2",
                "11", 
                "11 1/2",
                "12", 
                "mehr als 12 Stunden")

df <- tibble(Var1 =as.factor(sample(myLs, 40, replace = T)),
          Var2 =as.factor(sample(myLs, 40, replace = T)),
          Var4 = 1:40)

I would like to change the labels of the factors, such that they are ultimately numbers for which I can make calculations such as the mean. My current "solution" works, but is ugly, and I have the sense that there should be a better way to do this. Currently, I do:

levels(df$Var1) <- c(NA, 0, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 
                              5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11,
                              11.5, 12, 12.5)

levels(df$Var2) <- c(NA, 0, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 
                              5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11,
                              11.5, 12, 12.5)

df_final <- df %>%
  mutate(Var1 = as.double(as.character(Var1)), 
         Var2 = as.double(as.character(Var2)))

This does the job. However, in my real data frame, I have many more columns and it is a lot of copy-pasting etc. Shouldn't there be a more dplyr-ish solution using mutate_at? or something using lapply? or how about writing a function?

  change_levels <- function(data_frame, column_names){change the cols in place}

Thank you for your help!!

CodePudding user response:

Note that your factor levels are not in the order you expect and might not include all. E.g.:

> levels(df$Var1)
 [1] "-77"        "0"          "0 Stunden"  "1 Stunde"   "10"         "10 1/2"     "11 1/2"     "2 1/2"      "3"          "3 1/2"     
[11] "30 Minuten" "4 1/2"      "5"          "5 1/2"      "6 1/2"      "7"          "7 1/2"      "8"          "8 1/2"      "9"         
[21] "9 1/2"  

This means that your initial approach could lead to unexpected behavior, unless you define the factor with all it's levels with factor rather than as.factor:

df <- tibble(Var1 = factor(sample(myLs, 40, replace = T), levels = myLs),
             Var2 = factor(sample(myLs, 40, replace = T), levels = myLs),
             Var4 = 1:40)

A safer approach is to define a key. Then no matter how the factor is defined (and thus the order of the keys), you could use mutate, across and recode_factor/recode:

key <- setNames(c(NA, 0, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 
                  5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11,
                  11.5, 12, 12.5),
                myLs
                )

Similar to your approach:

library(dplyr)

df |>
  mutate(across(where(is.factor),
                ~ as.numeric(as.character(recode_factor(., !!!key)))))

However, why don't avoid most of the factors and just use recode:

library(dplyr)

df |>
  mutate(across(where(is.factor),
                ~ recode(., !!!key)))

Output:

# A tibble: 40 × 3
    Var1  Var2  Var4
   <dbl> <dbl> <int>
 1   6     6       1
 2   8.5  12.5     2
 3  NA    11       3
 4  12.5   7.5     4
 5  12.5   0       5
 6   8    NA       6
 7   3     8.5     7
 8   8.5   9       8
 9   8     5.5     9
10   0    NA      10
# … with 30 more rows

Alternatives to across(where(is.factor), ...) could be across(Var1:Var2, ...). It takes tidyselect verbs.

CodePudding user response:

The worst way to do this, is

myLS <- sub("-", NA, sub(" 1/2", ".5", sub("30 Minuten", "0.5",sub(" Stunden", "", sub("mehr als 12", "12.5", myLS)))))

and then

df <- tibble(Var1 =as.double(sample(myLS, 40, replace = T)),
             Var2 =as.double(sample(myLS, 40, replace = T)),
             Var4 = 1:40)

Please don't mention my name if you do this, you never heard this from me...

  • Related