I have a vector of a continuos variable. For example:
x <- c(0.9000000, 1.2666667, 4.0000000, 5.7333333, 19.7333333, 35.7666667, 44.0000000, 4.4333333, 0.4666667, 0.7000000, 0.9333333, 1.0000000, 1.0000000, 1.0000000, 1.2000000, 1.2333333, 1.2666667, 1.4333333, 1.7000000, 4.0666667, 1.9000000, 2.1000000, 0.9333333, 1.2666667, 3.7333333, 0.9333333, 2.7666667, 3.1333333, 3.9333333, 5.0333333, 6.0666667, 8.2333333)
I want to bin this vector by width (equal number of values) in three groups (low, medium and high values).
So the low
group would have the third lowest values of all the total.
Then I want to group the low and medium bins so I would have one categorical vector that would have Not high
subjects, which would be the 66% lowest, and High, which would be 33% highest.
I've checked and I couldn't find any predefined function to do this.
CodePudding user response:
You can use cut
to split and label a numeric vector. Essentially, you are looking for a single break point equal to the value that is 2/3 of the way along the ordered values of x
, so you can do:
break_point <- sort(x)[round(2 * length(x)/3)]
break_point
#> [1] 3.733333
Anything above 3.733333 will be 'high'. So we can do:
y <- cut(x, breaks = c(-Inf, break_point, Inf), labels = c('not high', 'high'))
If we put this in a data frame with x
, you can see y
appropriately labels the highest values:
data.frame(x, y)
#> x y
#> 1 0.9000000 not high
#> 2 1.2666667 not high
#> 3 4.0000000 high
#> 4 5.7333333 high
#> 5 19.7333333 high
#> 6 35.7666667 high
#> 7 44.0000000 high
#> 8 4.4333333 high
#> 9 0.4666667 not high
#> 10 0.7000000 not high
#> 11 0.9333333 not high
#> 12 1.0000000 not high
#> 13 1.0000000 not high
#> 14 1.0000000 not high
#> 15 1.2000000 not high
#> 16 1.2333333 not high
#> 17 1.2666667 not high
#> 18 1.4333333 not high
#> 19 1.7000000 not high
#> 20 4.0666667 high
#> 21 1.9000000 not high
#> 22 2.1000000 not high
#> 23 0.9333333 not high
#> 24 1.2666667 not high
#> 25 3.7333333 not high
#> 26 0.9333333 not high
#> 27 2.7666667 not high
#> 28 3.1333333 not high
#> 29 3.9333333 high
#> 30 5.0333333 high
#> 31 6.0666667 high
#> 32 8.2333333 high
And you can see that approximately 2/3 of the cases are 'not high' and 1/3 are 'high':
table(y) / length(x)
#> y
#> not high high
#> 0.65625 0.34375
You can't have exactly 2/3 in the 'not high' group because your vector is length 32, which isn't divisible by 3.
CodePudding user response:
You can use quantile()
:
y <- ifelse(x < quantile(x, 2/3), "not high", "high")
proportions(table(y))
# high not high
# 0.34375 0.65625
CodePudding user response:
You can use santoku::chop_equally()
:
library(santoku)
chopped <- santoku::chop_equally(x, 3, labels = c("low", "medium", "high"))
data.frame(x, chopped)
x chopped
1 0.9000000 low
2 1.2666667 medium
3 4.0000000 high
4 5.7333333 high
5 19.7333333 high
6 35.7666667 high
7 44.0000000 high
8 4.4333333 high
9 0.4666667 low
10 0.7000000 low
...
You can then regroup the factor (if you want to keep the low/medium/high version):
library(forcats)
chopped2 <- forcats::fct_collapse(chopped,
"High" = "high",
other_level = "Not high"
)
data.frame(x, chopped2)
x chopped2
1 0.9000000 Not high
2 1.2666667 Not high
3 4.0000000 High
4 5.7333333 High
5 19.7333333 High
6 35.7666667 High
7 44.0000000 High
8 4.4333333 High
9 0.4666667 Not high
10 0.7000000 Not high
...
Alternatively, if you only want the "High"/"Not high" version, use
chop_quantiles()
:
chopped2 <- santoku::chop_quantiles(x, .66,
labels = c("Not high", "High"))
data.frame(x, chopped2)
x chopped2
1 0.9000000 Not high
2 1.2666667 Not high
3 4.0000000 High
4 5.7333333 High
5 19.7333333 High
6 35.7666667 High
7 44.0000000 High
8 4.4333333 High
9 0.4666667 Not high
10 0.7000000 Not high
...
You say you want to bin by "width (equal number of values)". The above bins by equal number of values, i.e. 1/3 in each of the 3 categories. If you want to bin by width, i.e. into equal-width intervals, use santoku::chop_evenly()
:
chopped3 <- santoku::chop_evenly(x, 3, labels = c("low", "medium", "high"))
data.frame(x, chopped3)
x chopped3
1 0.9000000 low
2 1.2666667 low
3 4.0000000 low
4 5.7333333 low
5 19.7333333 medium
6 35.7666667 high
7 44.0000000 high
8 4.4333333 low
9 0.4666667 low
10 0.7000000 low
...
Note: I am the santoku package maintainer.