I have a dataset with a column, X1, of various values. I would like to order this dataset by the value of X1, and then partition into K number of equal sum subsets. How can this be accomplished in R? I am able to find quartiles for X1 and append the quartile groupings as a new column to the dataset, however, quartile is not quite what I'm looking for. Thank you in advance!
df <- data.frame(replicate(10,sample(0:1000,1000,rep=TRUE)))
df <- within(df, quartile <- as.integer(cut(X1, quantile(X1, probs=0:4/4), include.lowest=TRUE)))
CodePudding user response:
Here's a rough solution (using set.seed(47)
if you want to reproduce exactly). I calculate the proportion of the sum for each row, and do the cumsum
of that proportion, and then cut
that into the desired number of buckets.
library(dplyr)
n_groups = 10
df %>% arrange(X1) %>%
mutate(
prop = X1 / sum(X1),
cprop = cumsum(prop),
bins = cut(cprop, breaks = n_groups - 1)
) %>%
group_by(bins) %>%
summarize(
group_n = n(),
group_sum = sum(X1)
)
# # A tibble: 9 × 3
# bins group_n group_sum
# <fct> <int> <int>
# 1 (-0.001,0.111] 322 54959
# 2 (0.111,0.222] 141 54867
# 3 (0.222,0.333] 111 55186
# 4 (0.333,0.444] 92 55074
# 5 (0.444,0.556] 80 54976
# 6 (0.556,0.667] 71 54574
# 7 (0.667,0.778] 66 55531
# 8 (0.778,0.889] 60 54731
# 9 (0.889,1] 57 55397
This could of course be simplified--you don't need to keep around the extra columns, just mutate(bins = cut(cumsum(X1 / sum(X1)), breaks = n_groups - 1))
will add the bins
column to the original data (and no other columns), and the group_by() %>% summarize()
is just to diagnose the result.