I am working with the R programming language. Suppose I have the following data frame:
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,10,10)
var_3 = rnorm(100,10,10)
d = data.frame(var_1, var_2, var_3)
head(d)
var_1 var_2 var_3
1 14.251923 14.877801 22.636207
2 7.325137 8.513718 21.021522
3 3.400001 -3.400397 11.274797
4 16.400597 8.623980 9.366115
5 7.065583 13.155570 17.891432
6 21.297912 4.341385 -11.337330
My Question: For each element in each variable, I want to replace the element with the percentile it belongs to.
For example:
a = quantile(d$var_1, c( 0.15, 0.3, 0.35, 0.45, 0.5, 0.65, 0.7 0.8, 0.85, 0.9, 0.95, 1))
b = quantile(d$var_2, c(0.16, 0.23, 0.65, 0.71, 0.95))
c = quantile(d$var_3, c(0.15, 0.28, 0.7, 0.73, 0.87))
> a
5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75%
-0.8806901 0.3595086 1.1201300 3.0581928 5.0901641 7.0056228 7.6089831 8.9853805 9.9264540 10.2235212 11.5707533 13.2422940 15.1076889 16.5354881 17.9336020
80% 85% 90% 95% 100%
19.5312682 21.9264905 24.4511364 26.6820271 41.4419744
> b
16% 23% 65% 71% 95%
-2.795294 1.430715 11.070815 12.688064 25.270823
> c
15% 28% 70% 73% 87%
0.958404 5.767591 15.258532 16.013648 20.467892
For example:
- if d$var_2 < -2.795294 , then d$var_2 = 16th percentile
- if d$var_3 between (5.767591 , 15.258532), then d$var_3 = 70th percentile
I can write multiple "if statements" do this manually, but is there a faster way to do this?
Thanks!
CodePudding user response:
You could do something like this by applying a custom function:
library(tidyverse)
ApplyQuantiles <- function(x, y) {
cut(
x,
breaks = c(quantile(x, probs = y)),
labels = c(names(quantile(x, probs = y))[-1]),
include.lowest = TRUE
)
}
output <- d %>%
mutate(var_1 = ApplyQuantiles(var_1, c(0, 0.15, 0.3, 0.35, 0.45, 0.5, 0.65, 0.7, 0.8, 0.85, 0.9, 0.95, 1)),
var_2 = ApplyQuantiles(var_2, c(0, 0.16, 0.23, 0.65, 0.71, 0.95, 1.0)),
var_3 = ApplyQuantiles(var_3, c(0, 0.15, 0.28, 0.7, 0.73, 0.87, 1.0))) %>%
mutate(across(everything(), str_replace, pattern = "%", replacement = "th percentile"))
Output
head(output, 10)
var_1 var_2 var_3
1 45th percentile 95th percentile 87th percentile
2 35th percentile 100th percentile 70th percentile
3 70th percentile 95th percentile 70th percentile
4 80th percentile 65th percentile 70th percentile
5 30th percentile 16th percentile 28th percentile
6 15th percentile 95th percentile 28th percentile
7 30th percentile 16th percentile 15th percentile
8 45th percentile 16th percentile 70th percentile
9 65th percentile 95th percentile 70th percentile
10 45th percentile 65th percentile 70th percentile