How to partition into equal sum subsets in R?-CodePudding

I have a dataset with a column, X1, of various values. I would like to order this dataset by the value of X1, and then partition into K number of equal sum subsets. How can this be accomplished in R? I am able to find quartiles for X1 and append the quartile groupings as a new column to the dataset, however, quartile is not quite what I'm looking for. Thank you in advance!

df <- data.frame(replicate(10,sample(0:1000,1000,rep=TRUE)))
df <- within(df, quartile <- as.integer(cut(X1, quantile(X1, probs=0:4/4), include.lowest=TRUE)))

CodePudding user response：

Here's a rough solution (using set.seed(47) if you want to reproduce exactly). I calculate the proportion of the sum for each row, and do the cumsum of that proportion, and then cut that into the desired number of buckets.

library(dplyr)

n_groups = 10

df %>% arrange(X1) %>%
  mutate(
    prop = X1 / sum(X1),
    cprop = cumsum(prop),
    bins = cut(cprop, breaks = n_groups - 1)
  ) %>%
  group_by(bins) %>% 
  summarize(
    group_n = n(),
    group_sum = sum(X1)
  )
# # A tibble: 9 × 3
#   bins           group_n group_sum
#   <fct>            <int>     <int>
# 1 (-0.001,0.111]     322     54959
# 2 (0.111,0.222]      141     54867
# 3 (0.222,0.333]      111     55186
# 4 (0.333,0.444]       92     55074
# 5 (0.444,0.556]       80     54976
# 6 (0.556,0.667]       71     54574
# 7 (0.667,0.778]       66     55531
# 8 (0.778,0.889]       60     54731
# 9 (0.889,1]           57     55397

This could of course be simplified--you don't need to keep around the extra columns, just mutate(bins = cut(cumsum(X1 / sum(X1)), breaks = n_groups - 1)) will add the bins column to the original data (and no other columns), and the group_by() %>% summarize() is just to diagnose the result.