Home > OS >  Excluding outliers in ntile()
Excluding outliers in ntile()

Time:02-20

I am trying to assign quantile groups for a stacked data such that for each category (r1 and r2 in my example) of data, I can classify the values into 5 groups. I can manage to do this using ntile() as follows.

r1<-rnorm(10,0,1)
r2<-rnorm(10,2,4)
df<-cbind(r1,r2)
df<-melt(df)

df<-df%>%group_by(Var2) %>% mutate(group=ntile(value,5)) 

However, what should I do if I hope to exclude the top and bottom 10% when sorting the groups. Ideally, I hope to keep those top and bottom values in the output table with their group code showing as "NA".

Thanks to anyone who can help!

CodePudding user response:

Your question is a little ambiguous. It is not clear whether you wish to exclude the top and bottom 10% from the quintile calculation (so that you are getting equal quintiles of the 10-90th centiles of the original data), or whether you want to do the quintiles first on all the data, then exclude the first and last 10%. Doing it the second way will give you smaller 1st and 5th quintiles, so I assume you mean the first method:

df %>% 
  group_by(Var2) %>% 
  mutate(group = ntile(value, 10)) %>%
  mutate(group = ntile(ifelse(group %% 9 == 1, NA, value), 5))
#> # A tibble: 20 x 4
#> # Groups:   Var2 [2]
#>     Var1 Var2   value group
#>    <int> <fct>  <dbl> <int>
#>  1     1 r1    -0.626     1
#>  2     2 r1     0.184     2
#>  3     3 r1    -0.836    NA
#>  4     4 r1     1.60     NA
#>  5     5 r1     0.330     3
#>  6     6 r1    -0.820     1
#>  7     7 r1     0.487     3
#>  8     8 r1     0.738     5
#>  9     9 r1     0.576     4
#> 10    10 r1    -0.305     2
#> 11     1 r2     8.05     NA
#> 12     2 r2     3.56      2
#> 13     3 r2    -0.485     1
#> 14     4 r2    -6.86     NA
#> 15     5 r2     6.50      5
#> 16     6 r2     1.82      1
#> 17     7 r2     1.94      2
#> 18     8 r2     5.78      4
#> 19     9 r2     5.28      3
#> 20    10 r2     4.38      3

Just in case, the second method you would achieve like this:

df %>% 
  group_by(Var2) %>% 
  mutate(group = ntile(value, 5)) %>%
  mutate(group = ifelse(ntile(value, 10) %% 9 == 1, NA, group))
#> # A tibble: 20 x 4
#> # Groups:   Var2 [2]
#>     Var1 Var2   value group
#>    <int> <fct>  <dbl> <int>
#>  1     1 r1    -0.626     2
#>  2     2 r1     0.184     3
#>  3     3 r1    -0.836    NA
#>  4     4 r1     1.60     NA
#>  5     5 r1     0.330     3
#>  6     6 r1    -0.820     1
#>  7     7 r1     0.487     4
#>  8     8 r1     0.738     5
#>  9     9 r1     0.576     4
#> 10    10 r1    -0.305     2
#> 11     1 r2     8.05     NA
#> 12     2 r2     3.56      3
#> 13     3 r2    -0.485     1
#> 14     4 r2    -6.86     NA
#> 15     5 r2     6.50      5
#> 16     6 r2     1.82      2
#> 17     7 r2     1.94      2
#> 18     8 r2     5.78      4
#> 19     9 r2     5.28      4
#> 20    10 r2     4.38      3

Created on 2022-02-19 by the reprex package (v2.0.1)


Setup and data used

library(dplyr)
library(reshape2)

set.seed(1)

r1 <- rnorm(10,0,1)
r2 <- rnorm(10,2,4)
df <- cbind(r1,r2)
df <- melt(df)
  • Related