I want to calculate the empirical quantile for each value in multiple columns of my data.frame.
df <- structure(list(id = 1:20, number_1 = c(32, 220, -176, 113, 4,
317, -55, 41, -97, 43, -8, -129, 125, 30, 44, 88, -42, 7, 73,
-5), number_2 = c(15, -105, 77, 41, -183, 120, -216, -46, 40,
-9, -46, 318, -92, 223, -66, 88, -45, -48, -1, -208), number_3 = c(-16,
223, 34, 46, -62, 18, -158, 68, 124, 130, 54, 89, 89, -93, 46,
-126, 21, -20, 118, 20)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
# head(df)
# A tibble: 6 x 4
id number_1 number_2 number_3
<int> <dbl> <dbl> <dbl>
1 1 32 15 -16
2 2 220 -105 223
3 3 -176 77 34
4 4 113 41 46
5 5 4 -183 -62
6 6 317 120 18
I found this on SO, but the function ecdf()
seems not to be working for calculations of columns within mutate()
.
I got it working by using this approach:
# generate a temporary df
quantiles_temp <- map(df, ~quantile(.x, probs = seq(0, 1, 0.2)))
# Use quantiles_temp list to check each value in df with a case_when
df_quantiles <- df %>%
mutate(across(number_1:number_3,
~case_when(
.x <= quantiles_temp[[cur_column()]]["20%"] ~ "0% - 20%",
.x >= quantiles_temp[[cur_column()]]["20%"] & .x <= quantiles_temp[[cur_column()]]["40%"] ~ "20% - 40%",
.x >= quantiles_temp[[cur_column()]]["40%"] & .x <= quantiles_temp[[cur_column()]]["60%"] ~ "40% - 60%",
.x >= quantiles_temp[[cur_column()]]["60%"] & .x <= quantiles_temp[[cur_column()]]["80%"] ~ "60% - 80%",
.x >= quantiles_temp[[cur_column()]]["80%"] ~ "80% - 100%"))
)
> head(df_quantiles)
# A tibble: 6 x 4
id number_1 number_2 number_3
<int> <chr> <chr> <chr>
1 1 40% - 60% 60% - 80% 20% - 40%
2 2 80% - 100% 0% - 20% 80% - 100%
3 3 0% - 20% 60% - 80% 40% - 60%
4 4 80% - 100% 60% - 80% 40% - 60%
5 5 20% - 40% 0% - 20% 0% - 20%
6 6 80% - 100% 80% - 100% 20% - 40%
Is there a better way?
My approach requires me to generate a temporary list quantiles_temp
to look up the empirical quantile for each value in each column.
As I am trying to optimize my code in terms of readability and performance, I was wondering if there is a better way to calculate the quantiles for each column?
CodePudding user response:
Try this:
library(dplyr)
df %>%
mutate(
across(number_1:number_3,
~ cut(., quantile(., seq(0, 1, by = 0.2)),
labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
include.lowest = TRUE))
)
# # A tibble: 20 x 4
# id number_1 number_2 number_3
# <int> <fct> <fct> <fct>
# 1 1 40-60% 60-80% 20-40%
# 2 2 80-100% 0-20% 80-100%
# 3 3 0-20% 60-80% 40-60%
# 4 4 80-100% 60-80% 40-60%
# 5 5 20-40% 0-20% 0-20%
# 6 6 80-100% 80-100% 20-40%
# 7 7 0-20% 0-20% 0-20%
# 8 8 40-60% 20-40% 60-80%
# 9 9 0-20% 60-80% 80-100%
# 10 10 60-80% 40-60% 80-100%
# 11 11 20-40% 20-40% 60-80%
# 12 12 0-20% 80-100% 60-80%
# 13 13 80-100% 20-40% 60-80%
# 14 14 40-60% 80-100% 0-20%
# 15 15 60-80% 20-40% 40-60%
# 16 16 60-80% 80-100% 0-20%
# 17 17 20-40% 40-60% 40-60%
# 18 18 40-60% 20-40% 20-40%
# 19 19 60-80% 40-60% 80-100%
# 20 20 20-40% 0-20% 20-40%
This produces factor
s; if you choose, wrap the cut(.)
in as.character
if you prefer strings (though I suggest that the ordinal nature of the factors provides a sound basis for things like sorting/ranking).