Home > database >  Perform row-wise operation in datatable with multiple elements
Perform row-wise operation in datatable with multiple elements

Time:05-07

I have the following data table:

library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))

Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let's assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 y_1, x_1 y2, x_1 y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 y_1, x_2 y_2, x_2 y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).

Edit: Exemplary code to do this only for X_1 and group A:

current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X   y]
DT

One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.

I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.

Thanks for any hints how to approach this.

CodePudding user response:

You can do this:

DT[, .(.BY$x DT[group==.BY$group,y]), by=.(x,group)]

This returns N rows per x, where N is the size of x's group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)

Output:

          x  group    V1
      <int> <char> <int>
   1:    68      A   144
   2:    68      A   107
   3:    68      A    92
   4:    68      A   121
   5:    68      A   160
  ---                   
4996:     4      B    25
4997:     4      B    66
4998:     4      B    83
4999:     4      B    27
5000:     4      B    68

You can also accomplish this via a join:

DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x y][order(ind)]

Output:

        ind     x  group     y total
      <int> <int> <char> <int> <int>
   1:     1    68      A    76   144
   2:     1    68      A    39   107
   3:     1    68      A    24    92
   4:     1    68      A    53   121
   5:     1    68      A    92   160
  ---                               
4996:   100     4      B    21    25
4997:   100     4      B    62    66
4998:   100     4      B    79    83
4999:   100     4      B    23    27
5000:   100     4      B    64    68
  • Related