Home > database >  Why do quantile function in R gives unequal count of values in each group
Why do quantile function in R gives unequal count of values in each group

Time:08-19

I am trying to group a continuous value data into tertile. I am using the function quantile to do this. following is my code

dd$wbc_tert = with(dd, 
               cut(wbc, 
                   vTert, 
                   include.lowest = T, 
                   labels = c("Low", "Medium", "High")))

Isn't it supposed to give equal count of values in each group? I am getting different count in the groups.

> dd %>% filter(wbc_tert == 'High') %>% select('wbc')  %>% nrow() 
[1] 143
> dd %>% filter(wbc_tert == 'Low') %>% select('wbc')  %>% nrow()
[1] 148
> dd %>% filter(wbc_tert == 'Medium') %>% select('wbc')  %>% nrow()
[1] 139

This is the dput of the values

c(10.9, 5.4, 9.1, 7.4, 6.6, 5.5, 4.4, 6.7, 7.8, 6.7, 6.6, 8.6, 
8.4, 4.8, 7, 5.2, 7, 6.7, 10.4, 7.5, 8.5, 6.8, 8.5, 9.4, 4.6, 
6.8, 10.2, 6.7, 4.6, 4.9, 6.7, 8.9, 5.9, 5.9, 9.9, 4.1, 8.4, 
9, 7.7, 8.2, 5.7, 8.4, 7.7, 4.6, 6.5, 7.3, 4.9, 3.8, 6.2, 7.9, 
5.3, 8.9, 6, 4.8, 5.9, 5.4, 8.6, 6.1, 9.5, 5.8, 6.2, 5.8, 7.9, 
9.6, 6.6, 9.6, 7, 10.1, 9, 6.9, 9.1, 6.8, 8.4, 9.6, 4.4, 10.5, 
7.9, 5.6, 5.1, 6.6, 6.5, 12.7, 5.3, 7.7, 4.8, 4.7, 6.1, 4.3, 
6.1, 11.6, 5.9, 7.4, 5.7, 4.7, 4.8, 8.5, 5.6, 7.9, 9.1, 7.8, 
5.3, 5, 8.1, 8.3, 4.7, 5.4, 7.6, 7.2, 5.7, 7.9, 7.9, 6.4, 3.8, 
4.7, 6.2, 5, 7.6, 5.8, 5.4, 4.3, 6, 4.7, 6, 6.1, 5.8, 5.6, 4.7, 
5, 11.5, 6.3, 4.4, 6.8, 6.6, 6.8, 6.1, 4.8, 5.4, 5.8, 5.2, 7.1, 
5.4, 9.1, 6.9, 5.4, 8.5, 5.3, 7.3, 6.9, 9, 6.3, 8.4, 7.8, 5.7, 
6.4, 5.3, 9.6, 6.4, 9.9, 8.9, 7.7, 6.2, 7.2, 4.6, 5.4, 4.6, 11.2, 
3.1, 12.3, 5.9, 11.1, 6.2, 6.6, 4.1, 7.4, 9.4, 4.1, 6.7, 6.7, 
6.1, 6.3, 5.6, NA, 3.7, 6.8, 6.7, 6.4, 7.3, 5.7, 6.7, 6.9, 5.7, 
5.3, 4, 5.6, 4.8, 5.5, 6, 6.6, 3.6, 5.6, 8.9, 6.3, 5.8, 8.2, 
8.6, 8.5, 5.7, 8.6, 6, 5.1, 5.7, 8.2, 5.4, 6.9, 6.9, 8.3, 9.5, 
5.4, 10.2, 8.8, 7.2, 4.8, 9.8, 4.6, 6.3, 5.8, 4.9, 12.7, 7.5, 
10.6, 9.3, 5.5, 10.7, 6.2, 9.3, 8.3, 7.8, 8.05, 9.57, 6.62, 6.21, 
5.34, 6.11, 10.37, 4.45, 5.55, 8.05, 8.31, 5.06, 6.05, 4.76, 
9.09, 9.11, 9.04, 6.99, 6.33, 9.47, 6.48, 4.46, 9.44, 6.88, 7.09, 
5.75, 10.89, 6.68, 3.64, 6.55, 8.69, 5.89, 9.05, 6.38, 11.62, 
9.11, 9.22, 7.97, 9.64, 12.76, 8.39, 6.57, 8.1, 7.3, 10.1, 4.7, 
6.4, 7.2, 5.5, 3.7, 5.1, 9.8, 7.6, 7.7, 6, 3.9, 6.8, 5.4, 5.4, 
9.7, 9, 6, 7.3, 6.3, 5.8, 8.3, 7, 4.1, 11.2, 5, 7.6, 6.5, 4.8, 
8, 10.1, 7.1, 7.4, 4.3, 4, 10.12, 4.3, 7.26, 8.84, 8.44, 8.44, 
8.12, 6.5, 8.58, 8.55, 8.82, 4.53, 9.51, 4.93, 4.42, 4.69, 8.69, 
5.77, 3.37, 6.58, 3.72, 3.09, 7.13, 8.11, 7.2, 12.18, 6.52, 7.91, 
5.69, 8.24, 7.67, 5.69, 4.85, 7.03, 4.16, 3.57, 8.1, 4.61, 5.98, 
5.13, 7.68, 5.47, 5.54, 4.59, 6, 11.62, 7.38, 7.06, 8.74, 8.02, 
6.73, 7.19, 6.36, 4.86, 6.55, 8.4, 7.76, 4.73, 4.8, 5.73, 8.53, 
4.6, 7.96, 9.48, 6.59, 5.75, 6.61, 6.49, 7.91, 6.92, 7.14, 6.24, 
12.53, 7.03, 4.73, 8.05, 7.26, 4.07, 6.7, 5.7, 7.39, 5.2, 6.61, 
6.8, 6.77, 5.65, 6.08, 7.24, 6.13, 7.92, 7.37, 7.99, 3.31, 9.72, 
8.71, 8.35, 5.05, 8.15, 5.1, 5.4, 8.8, 4.9, 5, 7.43, 10.3, 6.3, 
9.5, 6.9, 6.7, 5.4, 7.7, 8, 6.5, 5.6, 9.7)

Can someone please help what could be the reason

CodePudding user response:

Here is an example that shows that quantiles are not necessarily symmetric in the number of values.

# Define some data
x <- 1:10
y <-  rep(1:2, 10)
# Look at the quantiles
quantile(x)
#>    0%   25%   50%   75%  100% 
#>  1.00  3.25  5.50  7.75 10.00
# Due to the added y we now have asymmetry in sizes
quantile(c(y,x))
#>    0%   25%   50%   75%  100% 
#>  1.00  1.00  2.00  2.75 10.00
# Notice how the number of values below 50 % and 75 % changes.
## Without y we get roughly the same bin size
sum(x<quantile(x, .5))
#> [1] 5
sum(x<quantile(x, .75))
#> [1] 7
## But when we add y, there is a doubling of values despite we only increase
## the percentile with 25 %
sum(c(y,x)<quantile(c(y,x), .5))
#> [1] 11
sum(c(y,x)<quantile(c(y,x), .75))
#> [1] 22

Created on 2022-08-18 by the reprex package (v2.0.1)

CodePudding user response:

There are 9 ways to compute the quantiles with function quantile, the default is type = 7.
Use findInterval instead of cut.

vTert <- quantile(dd$wbc, (0:3)/3, na.rm = TRUE)

dd$wbc_tert <- findInterval(dd$wbc, vTert, rightmost.closed = TRUE, all.inside = TRUE)
dd$wbc_tert <- factor(dd$wbc_tert, labels = c("Low", "Medium", "High"))

table(dd$wbc_tert, useNA = "always")
#> 
#>    Low Medium   High   <NA> 
#>    143    143    144      1

Created on 2022-08-18 by the reprex package (v2.0.1)

  • Related