Perform row-wise operation in datatable with multiple elements-CodePudding

I have the following data table:

library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))

Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let's assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 y_1, x_1 y2, x_1 y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 y_1, x_2 y_2, x_2 y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).

Edit: Exemplary code to do this only for X_1 and group A:

current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X   y]
DT

One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.

I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.

Thanks for any hints how to approach this.

CodePudding user response：

You can do this:

DT[, .(.BY$x DT[group==.BY$group,y]), by=.(x,group)]

This returns N rows per x, where N is the size of x's group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)

Output:

          x  group    V1
      <int> <char> <int>
   1:    68      A   144
   2:    68      A   107
   3:    68      A    92
   4:    68      A   121
   5:    68      A   160
  ---                   
4996:     4      B    25
4997:     4      B    66
4998:     4      B    83
4999:     4      B    27
5000:     4      B    68

You can also accomplish this via a join:

DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x y][order(ind)]

Output:

        ind     x  group     y total
      <int> <int> <char> <int> <int>
   1:     1    68      A    76   144
   2:     1    68      A    39   107
   3:     1    68      A    24    92
   4:     1    68      A    53   121
   5:     1    68      A    92   160
  ---                               
4996:   100     4      B    21    25
4997:   100     4      B    62    66
4998:   100     4      B    79    83
4999:   100     4      B    23    27
5000:   100     4      B    64    68