I have a data frame in which I would like to change multiple columns. They are factors and there are many different levels, the data frame is similar to this:
myLs <- c("-77",
"0",
"0 Stunden",
"30 Minuten",
"1 Stunde",
"1 1/2 Stunden",
"2",
"2 1/2",
"3",
"3 1/2",
"4",
"4 1/2",
"5",
"5 1/2",
"6",
"6 1/2",
"7",
"7 1/2",
"8",
"8 1/2",
"9",
"9 1/2",
"10",
"10 1/2",
"11",
"11 1/2",
"12",
"mehr als 12 Stunden")
df <- tibble(Var1 =as.factor(sample(myLs, 40, replace = T)),
Var2 =as.factor(sample(myLs, 40, replace = T)),
Var4 = 1:40)
I would like to change the labels of the factors, such that they are ultimately numbers for which I can make calculations such as the mean. My current "solution" works, but is ugly, and I have the sense that there should be a better way to do this. Currently, I do:
levels(df$Var1) <- c(NA, 0, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5,
5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11,
11.5, 12, 12.5)
levels(df$Var2) <- c(NA, 0, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5,
5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11,
11.5, 12, 12.5)
df_final <- df %>%
mutate(Var1 = as.double(as.character(Var1)),
Var2 = as.double(as.character(Var2)))
This does the job. However, in my real data frame, I have many more columns and it is a lot of copy-pasting etc. Shouldn't there be a more dplyr-ish solution using mutate_at? or something using lapply? or how about writing a function?
change_levels <- function(data_frame, column_names){change the cols in place}
Thank you for your help!!
CodePudding user response:
Note that your factor levels are not in the order you expect and might not include all. E.g.:
> levels(df$Var1)
[1] "-77" "0" "0 Stunden" "1 Stunde" "10" "10 1/2" "11 1/2" "2 1/2" "3" "3 1/2"
[11] "30 Minuten" "4 1/2" "5" "5 1/2" "6 1/2" "7" "7 1/2" "8" "8 1/2" "9"
[21] "9 1/2"
This means that your initial approach could lead to unexpected behavior, unless you define the factor with all it's levels with factor
rather than as.factor
:
df <- tibble(Var1 = factor(sample(myLs, 40, replace = T), levels = myLs),
Var2 = factor(sample(myLs, 40, replace = T), levels = myLs),
Var4 = 1:40)
A safer approach is to define a key. Then no matter how the factor is defined (and thus the order of the keys), you could use mutate
, across
and recode_factor
/recode
:
key <- setNames(c(NA, 0, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5,
5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11,
11.5, 12, 12.5),
myLs
)
Similar to your approach:
library(dplyr)
df |>
mutate(across(where(is.factor),
~ as.numeric(as.character(recode_factor(., !!!key)))))
However, why don't avoid most of the factors and just use recode
:
library(dplyr)
df |>
mutate(across(where(is.factor),
~ recode(., !!!key)))
Output:
# A tibble: 40 × 3
Var1 Var2 Var4
<dbl> <dbl> <int>
1 6 6 1
2 8.5 12.5 2
3 NA 11 3
4 12.5 7.5 4
5 12.5 0 5
6 8 NA 6
7 3 8.5 7
8 8.5 9 8
9 8 5.5 9
10 0 NA 10
# … with 30 more rows
Alternatives to across(where(is.factor), ...)
could be across(Var1:Var2, ...)
. It takes tidyselect
verbs.
CodePudding user response:
The worst way to do this, is
myLS <- sub("-", NA, sub(" 1/2", ".5", sub("30 Minuten", "0.5",sub(" Stunden", "", sub("mehr als 12", "12.5", myLS)))))
and then
df <- tibble(Var1 =as.double(sample(myLS, 40, replace = T)),
Var2 =as.double(sample(myLS, 40, replace = T)),
Var4 = 1:40)
Please don't mention my name if you do this, you never heard this from me...