Bin a vector by width-CodePudding

I have a vector of a continuos variable. For example:

x <- c(0.9000000,  1.2666667,  4.0000000,  5.7333333, 19.7333333, 35.7666667, 44.0000000,  4.4333333,  0.4666667,  0.7000000,  0.9333333,  1.0000000,  1.0000000,  1.0000000,  1.2000000,  1.2333333, 1.2666667,  1.4333333,  1.7000000,  4.0666667,  1.9000000,  2.1000000,  0.9333333,  1.2666667,  3.7333333,  0.9333333,  2.7666667,  3.1333333,  3.9333333,  5.0333333,  6.0666667,  8.2333333)

I want to bin this vector by width (equal number of values) in three groups (low, medium and high values). So the low group would have the third lowest values of all the total.

Then I want to group the low and medium bins so I would have one categorical vector that would have Not high subjects, which would be the 66% lowest, and High, which would be 33% highest.

I've checked and I couldn't find any predefined function to do this.

CodePudding user response：

You can use cut to split and label a numeric vector. Essentially, you are looking for a single break point equal to the value that is 2/3 of the way along the ordered values of x, so you can do:

break_point <- sort(x)[round(2 * length(x)/3)]

break_point
#> [1] 3.733333

Anything above 3.733333 will be 'high'. So we can do:

y <- cut(x, breaks = c(-Inf, break_point, Inf), labels = c('not high', 'high'))

If we put this in a data frame with x, you can see y appropriately labels the highest values:

data.frame(x, y)
#>             x        y
#> 1   0.9000000 not high
#> 2   1.2666667 not high
#> 3   4.0000000     high
#> 4   5.7333333     high
#> 5  19.7333333     high
#> 6  35.7666667     high
#> 7  44.0000000     high
#> 8   4.4333333     high
#> 9   0.4666667 not high
#> 10  0.7000000 not high
#> 11  0.9333333 not high
#> 12  1.0000000 not high
#> 13  1.0000000 not high
#> 14  1.0000000 not high
#> 15  1.2000000 not high
#> 16  1.2333333 not high
#> 17  1.2666667 not high
#> 18  1.4333333 not high
#> 19  1.7000000 not high
#> 20  4.0666667     high
#> 21  1.9000000 not high
#> 22  2.1000000 not high
#> 23  0.9333333 not high
#> 24  1.2666667 not high
#> 25  3.7333333 not high
#> 26  0.9333333 not high
#> 27  2.7666667 not high
#> 28  3.1333333 not high
#> 29  3.9333333     high
#> 30  5.0333333     high
#> 31  6.0666667     high
#> 32  8.2333333     high

And you can see that approximately 2/3 of the cases are 'not high' and 1/3 are 'high':

table(y) / length(x)
#> y
#> not high     high 
#>  0.65625  0.34375

You can't have exactly 2/3 in the 'not high' group because your vector is length 32, which isn't divisible by 3.

CodePudding user response：

You can use quantile():

y <- ifelse(x < quantile(x, 2/3), "not high", "high")

proportions(table(y))

#     high not high 
#  0.34375  0.65625

CodePudding user response：

You can use santoku::chop_equally():

library(santoku)
chopped <- santoku::chop_equally(x, 3, labels = c("low", "medium", "high"))
data.frame(x, chopped)
            x chopped
1   0.9000000     low
2   1.2666667  medium
3   4.0000000    high
4   5.7333333    high
5  19.7333333    high
6  35.7666667    high
7  44.0000000    high
8   4.4333333    high
9   0.4666667     low
10  0.7000000     low
...

You can then regroup the factor (if you want to keep the low/medium/high version):

library(forcats)
chopped2 <- forcats::fct_collapse(chopped, 
                                    "High" = "high", 
                                     other_level = "Not high"
                                  )
data.frame(x, chopped2)
            x chopped2
1   0.9000000 Not high
2   1.2666667 Not high
3   4.0000000     High
4   5.7333333     High
5  19.7333333     High
6  35.7666667     High
7  44.0000000     High
8   4.4333333     High
9   0.4666667 Not high
10  0.7000000 Not high
...

Alternatively, if you only want the "High"/"Not high" version, use chop_quantiles():

chopped2 <- santoku::chop_quantiles(x, .66, 
                                    labels = c("Not high", "High"))
data.frame(x, chopped2)
            x chopped2
1   0.9000000 Not high
2   1.2666667 Not high
3   4.0000000     High
4   5.7333333     High
5  19.7333333     High
6  35.7666667     High
7  44.0000000     High
8   4.4333333     High
9   0.4666667 Not high
10  0.7000000 Not high
...

You say you want to bin by "width (equal number of values)". The above bins by equal number of values, i.e. 1/3 in each of the 3 categories. If you want to bin by width, i.e. into equal-width intervals, use santoku::chop_evenly():

chopped3 <- santoku::chop_evenly(x, 3, labels = c("low", "medium", "high"))
data.frame(x, chopped3)
            x chopped3
1   0.9000000      low
2   1.2666667      low
3   4.0000000      low
4   5.7333333      low
5  19.7333333   medium
6  35.7666667     high
7  44.0000000     high
8   4.4333333      low
9   0.4666667      low
10  0.7000000      low
...

Note: I am the santoku package maintainer.