Excluding outliers in ntile()-CodePudding

I am trying to assign quantile groups for a stacked data such that for each category (r1 and r2 in my example) of data, I can classify the values into 5 groups. I can manage to do this using ntile() as follows.

r1<-rnorm(10,0,1)
r2<-rnorm(10,2,4)
df<-cbind(r1,r2)
df<-melt(df)

df<-df%>%group_by(Var2) %>% mutate(group=ntile(value,5))

However, what should I do if I hope to exclude the top and bottom 10% when sorting the groups. Ideally, I hope to keep those top and bottom values in the output table with their group code showing as "NA".

Thanks to anyone who can help!

CodePudding user response：

Your question is a little ambiguous. It is not clear whether you wish to exclude the top and bottom 10% from the quintile calculation (so that you are getting equal quintiles of the 10-90th centiles of the original data), or whether you want to do the quintiles first on all the data, then exclude the first and last 10%. Doing it the second way will give you smaller 1st and 5th quintiles, so I assume you mean the first method:

df %>% 
  group_by(Var2) %>% 
  mutate(group = ntile(value, 10)) %>%
  mutate(group = ntile(ifelse(group %% 9 == 1, NA, value), 5))
#> # A tibble: 20 x 4
#> # Groups:   Var2 [2]
#>     Var1 Var2   value group
#>    <int> <fct>  <dbl> <int>
#>  1     1 r1    -0.626     1
#>  2     2 r1     0.184     2
#>  3     3 r1    -0.836    NA
#>  4     4 r1     1.60     NA
#>  5     5 r1     0.330     3
#>  6     6 r1    -0.820     1
#>  7     7 r1     0.487     3
#>  8     8 r1     0.738     5
#>  9     9 r1     0.576     4
#> 10    10 r1    -0.305     2
#> 11     1 r2     8.05     NA
#> 12     2 r2     3.56      2
#> 13     3 r2    -0.485     1
#> 14     4 r2    -6.86     NA
#> 15     5 r2     6.50      5
#> 16     6 r2     1.82      1
#> 17     7 r2     1.94      2
#> 18     8 r2     5.78      4
#> 19     9 r2     5.28      3
#> 20    10 r2     4.38      3

Just in case, the second method you would achieve like this:

df %>% 
  group_by(Var2) %>% 
  mutate(group = ntile(value, 5)) %>%
  mutate(group = ifelse(ntile(value, 10) %% 9 == 1, NA, group))
#> # A tibble: 20 x 4
#> # Groups:   Var2 [2]
#>     Var1 Var2   value group
#>    <int> <fct>  <dbl> <int>
#>  1     1 r1    -0.626     2
#>  2     2 r1     0.184     3
#>  3     3 r1    -0.836    NA
#>  4     4 r1     1.60     NA
#>  5     5 r1     0.330     3
#>  6     6 r1    -0.820     1
#>  7     7 r1     0.487     4
#>  8     8 r1     0.738     5
#>  9     9 r1     0.576     4
#> 10    10 r1    -0.305     2
#> 11     1 r2     8.05     NA
#> 12     2 r2     3.56      3
#> 13     3 r2    -0.485     1
#> 14     4 r2    -6.86     NA
#> 15     5 r2     6.50      5
#> 16     6 r2     1.82      2
#> 17     7 r2     1.94      2
#> 18     8 r2     5.78      4
#> 19     9 r2     5.28      4
#> 20    10 r2     4.38      3

^{Created on 2022-02-19 by the reprex package (v2.0.1)}

Setup and data used

library(dplyr)
library(reshape2)

set.seed(1)

r1 <- rnorm(10,0,1)
r2 <- rnorm(10,2,4)
df <- cbind(r1,r2)
df <- melt(df)