How to divide data into groups in a fast way-CodePudding

I have a large matrix with 12 columns and approximately 1.000.000 rows. Each column represents the money spent by a client in a given month, so with the 12 columns I have information for 1 full year. Each row represents one client.

I need to divide the people into groups based on how much money they spent each month, and I consider the following intervals:

money=0
0<money<=25
25<money<=50
50<money<=75

So for example group1 would be formed by clients that spent 0$ each month for the whole year, group2 would be clients who spent between 0 and 25$ the first month, and 0$ the rest of the months, and so on. In the end I have 12 months, and 4 intervals, so I need to divide data into 4^12=16.777.216 groups (I know this yields to more groups than observations, and that many of the groups will be empty or with very few clients, but that is another problem, so far I am interested in doing this division into groups)

I am currently working in R although I could also switch to Python if required (those are the programming languages I control best), and so far my only idea has been to use nested for loops, one for loop for each month. But this is very, very slow.

So my question is: is there a faster way to do this?

Here I provide a small example with fake data, 10 observations (instead of the 1.000.000), 5 columns (instead of 12) and a simplified version of my current code for doing the grouping.

set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50), nrow=10, ncol=5))

intervals = c(-4, -1, 0, 1, 4)

id_list = c()
group_list = c()

group_idx = 0

for(idx1 in 1:(length(intervals)-1))
{
  data1 = data[(data[, 2] >= intervals[idx1]) & (data[, 2] < intervals[idx1 1]),]
  for(idx2 in 1:(length(intervals))-1)
  {
    data2 = data1[(data1[, 3] >= intervals[idx2]) & (data1[, 3] < intervals[idx2 1]),]
    for(idx3 in 1:(length(intervals)-1))
    {
      data3 = data2[(data2[, 4] >= intervals[idx3]) & (data2[, 4] < intervals[idx3 1]),]
      for(idx4 in 1:(length(intervals)-1))
      {
        data4 = data3[(data3[, 5] >= intervals[idx4]) & (data3[, 5] < intervals[idx4 1]),]
        for(idx5 in 1:(length(intervals)-1))
        {
          data5 = data4[(data4[, 6] >= intervals[idx5]) & (data4[, 6] < intervals[idx5 1]),]
          group_idx = group_idx   1
          id_list = c(id_list, data5$id)
          group_list = c(group_list, rep(group_idx, nrow(data5)))
        }
      }
    }
  }
}

CodePudding user response：

If you do need to do this--which I certainly have my doubts about--I would suggest creating a matrix with the classification for each cell of the original data, and then pasting them together to make a group label.

Doing this we can set the group labels to be human readable, which might be nice.

I would recommend simply adding this grouping column to the original data and then using dplyr or data.table to do grouped operations for your next steps, but if you really want separate data frames for each you can then split the original data based on these group labels.

## I redid your sample data to put it on the same general scale as 
## your actual data
set.seed(5)
data = data.frame(id=1:10, matrix(rnorm(50, mean = 50, sd = 20), nrow=10, ncol=5))

my_breaks = c(0, 25 * 1:3, Inf)
## you could use default labels, but this seems nicer
my_labs = c("Low", "Med", "High", "Extreme")

## classify each value from the data
grouping = vapply(
  data[-1], \(x) as.character(cut(x, breaks = my_breaks)),
  FUN.VALUE = character(nrow(data))
)

## create labels for the groups
group_labels = apply(grouping, 2, \(x) paste(1:(ncol(data) - 1), x, sep = ":", collapse = " | "))

## either add the grouping value to the original data or split the data based on groups
data$group = group_labels
result = split(data, group_labels)

result
# $`1:(25,50] | 2:(75,Inf] | 3:(0,25] | 4:(50,75] | 5:(75,Inf] | 1:(25,50] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(50,75]`
#   id       X1       X2       X3       X4       X5
# 1  1 33.18289 74.55261 68.01024 56.31830 81.00121
# 6  6 37.94184 47.22028 44.13036 69.03148 61.24447
# 
# $`1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50] | 1:(25,50] | 2:(25,50] | 3:(0,25] | 4:(50,75] | 5:(25,50]`
#   id       X1       X2       X3       X4       X5
# 2  2 77.68719 33.96441 68.83739 72.19388 33.95154
# 7  7 40.55667 38.05374 78.37178 29.80935 32.25983
# 
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(50,75] | 1:(25,50] | 2:(75,Inf] | 3:(75,Inf] | 4:(25,50] | 5:(25,50]`
#   id       X1        X2       X3        X4       X5
# 3  3 24.89016 28.392148 79.35924 94.309211 48.50842
# 8  8 37.29257  6.320665 79.97548  9.990545 40.79511
# 
# $`1:(50,75] | 2:(50,75] | 3:(75,Inf] | 4:(50,75] | 5:(75,Inf] | 1:(50,75] | 2:(25,50] | 3:(0,25] | 4:(0,25] | 5:(25,50]`
#   id       X1       X2       X3       X4       X5
# 4  4 51.40286 46.84931 64.13522 74.34207 87.91336
# 9  9 44.28453 54.81635 36.85836 14.75628 35.51343
# 
# $`1:(75,Inf] | 2:(25,50] | 3:(25,50] | 4:(75,Inf] | 5:(25,50] | 1:(50,75] | 2:(25,50] | 3:(25,50] | 4:(25,50] | 5:(25,50]`
#    id       X1       X2       X3       X4       X5
# 5   5 84.22882 28.56480 66.38018 79.58444 40.86862
# 10 10 52.76216 44.81289 32.94409 47.14784 48.61578

CodePudding user response：

Using findInterval, a group ID can be added in a fraction of a second on a 1M row table:

library(data.table)
set.seed(538924142)
data <- data.frame(id = 1:1e6, matrix(runif(12e6, 0, 75)*sample(0:1, 12e6, TRUE, c(0.25, 0.75)), 1e6, 12))
system.time({
  setDT(data)[
    , grp := colSums(
      matrix(
        findInterval(
          t(as.matrix(.SD)),
          c(0, 25, 50, 75),
          left.open = TRUE
        ),
        12, 1e6
      )*4^(0:11)
    ),
    .SDcols = 2:13
  ]
})
#>    user  system elapsed 
#>    0.26    0.05    0.31
head(data)
#>    id       X1        X2        X3       X4       X5         X6        X7          X8       X9       X10      X11      X12      grp
#> 1:  1  0.00000 67.680617 26.178075 65.66532  0.00000 55.2356394  5.438976 72.20526839 70.47368  0.000000  0.00000 29.17494  8641772
#> 2:  2  0.00000  8.193552 10.482581 19.15885 30.28639 44.3917749  1.876230 11.19145219 55.22776 48.725632 17.18597 74.58265 14375508
#> 3:  3  0.00000 63.301921  0.000000 61.50508  0.00000  0.5755531 52.139676 51.46551228 58.90514 60.098006 12.90056  0.00000  2094284
#> 4:  4 18.06334 34.970526  9.599701 38.64339 57.00753 62.3455201 30.377876 73.73237960  0.00000 18.706219  0.00000 25.57064  8712089
#> 5:  5 27.49489  8.770596  0.000000 67.30562 58.43427 26.2856874 65.784429 36.96939287 54.65132  3.676736 29.51849 25.35926 10992582
#> 6:  6 59.27949 14.830172  2.233060 13.27291 16.63301  2.5727847  0.000000  0.05254523 23.44611 29.529823  0.00000 63.00820 13190487
data[which.min(data$grp)]
#>        id       X1 X2      X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 grp
#> 1: 189293 4.801804  0 26.7038  0  0  0  0  0  0   0   0   0  33
data[which.max(data$grp)]
#>        id X1 X2       X3       X4       X5       X6       X7       X8       X9      X10     X11      X12      grp
#> 1: 637400  0  0 69.10316 56.61781 52.88433 62.50076 72.81748 57.27957 70.34022 72.01065 53.4228 56.72517 16777200

Then proceed with data.table subsetting and grouping operations. If you really want it split:

group_list <- split(data, by = "grp")

But,

Be aware that processing list of data.tables will be generally much slower than manipulation in single data.table by group using by argument