Home > Net >  R: Interchanging "Quantile" and "Ntile" Functions?
R: Interchanging "Quantile" and "Ntile" Functions?

Time:01-06

I am working with the R programming language.

I have the following dataset:

set.seed(123)
library(dplyr)

var1 = rnorm(10000, 100,100)
var2 = rnorm(10000, 100,100)
var3 = rnorm(10000, 100,100)
var4 = rnorm(10000, 100,100)
var5 <- factor(sample(c("A","B", "C", "D", "E"), 1000, replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2)))

my_data = data.frame( var1, var2, var3, var4, var5)

I was able to get the following code (involving "ntiles") to run:

test = my_data %>%
  group_by(var5) %>%
  mutate(group = ntile(var1, 4)) %>%
  group_by(var5, group) %>%
  mutate(min = min(var1),
         max = max(var1)) %>%
  mutate(range = paste(min, max, sep = "-")) %>%
  ungroup()

I now tried to replace the "ntile" function with the "quantile" function:

test = my_data %>%
  group_by(var5) %>%
  mutate(group = quantile(var1, c(0, 0.25, 0.5, 0.75, 1))) %>%
  group_by(var5, group) %>%
  mutate(min = min(var1),
         max = max(var1)) %>%
  mutate(range = paste(min, max, sep = "-")) %>%
  ungroup()

But I get the following error:

Error in `mutate()`:
! Problem while computing `group = quantile(var1, c(0, 0.25, 0.5, 0.75, 1))`.
x `group` must be size 2170 or 1, not 5.
i The error occurred in group 1: var5 = A.
Run `rlang::last_error()` to see where the error occurred.

Can someone please show me how to fix this?

Thanks!

CodePudding user response:

quantile returns the same length as the prob length

> quantile(rnorm(25), probs = c(0, 0.25, 0.5, 0.75, 1))
        0%        25%        50%        75%       100% 
-2.2104715 -1.3785488 -0.3379010  0.5721671  2.0572593 

whereas mutate requires the column to have the same length as the original length of the column. We may need cut here

test2 <- my_data %>%
  group_by(var5) %>%
  mutate(group = cut(var1, breaks = c(-Inf, 
        quantile(var1, c(0, 0.25, 0.5, 0.75, 1))))) %>%
  group_by(var5, group) %>%
  mutate(min = min(var1),
         max = max(var1)) %>%
  mutate(range = paste(min, max, sep = "-")) %>%
  ungroup()

-output

> test2
# A tibble: 10,000 × 9
    var1  var2    var3  var4 var5  group          min   max range                             
   <dbl> <dbl>   <dbl> <dbl> <fct> <fct>        <dbl> <dbl> <chr>                             
 1  44.0 337.    16.4   80.6 E     (35.5,99.5]   35.5  99.4 35.5075826967309-99.4142501887206 
 2  77.0  83.3   77.9  126.  E     (35.5,99.5]   35.5  99.4 35.5075826967309-99.4142501887206 
 3 256.  193.  -110.    46.2 E     (168,472]    168.  472.  168.347362097985-471.572072587951 
 4 107.   43.2  -66.8  -17.9 A     (96.7,166]    96.7 166.  96.7121945194114-165.961545193295 
 5 113.  123.    -9.80 190.  C     (99.2,166]    99.2 166.  99.2497216290279-166.111860991813 
 6 272.  213.   -66.6   98.4 E     (168,472]    168.  472.  168.347362097985-471.572072587951 
 7 146.  238.    95.0  118.  D     (102,170]    102.  170.  102.486419143665-170.378758447782 
 8 -26.5  76.7  256.   160.  D     (-229,33.4] -219.   33.3 -218.918610643503-33.3345619877818
 9  31.3 -60.1   59.5  126.  A     (-285,36.6] -247.   36.5 -246.749014053051-36.5196181691353
10  55.4  70.2  179.   130.  B     (30.5,97]     30.5  97.0 30.5063174765106-96.9569718037872 
# … with 9,990 more rows
  •  Tags:  
  • r
  • Related