R: Iterating over a data.table, with replacements-CodePudding

I've been racking my head over this over the past few days. I have a dataset that appears as follows:

V1 <- c("A", "B", "C", "D", "B", "A", "A", "D")
V2 <- c("B", "E", "A", "G", "C", "G", "E", "B")

R1 <- c(120, 195, 135, 30, 195, 120, 120, 30)
G1 <- c(0, 195, 0, 195, 195, 0, 0, 195)
B1 <- c(240, 195, 0, 135, 195, 240, 240, 135)

R2 <- c(195, 60, 120, 75, 135, 75, 60, 195)
G2 <- c(195, 15, 0, 15, 0, 15, 15, 195)
B2 <- c(195, 150, 240, 150, 0, 150, 150, 195)

cross <- data.frame(V1,V2,R1,G1,B1,R2,G2,B2)

This is a grossly simplified version of my actual dataset, which is a lot larger with over 60,000 observations. In the first two columns, V1 and V2 denote ID variables. Then R1, B1 and G1 correspond to the attributes associated with V1 and, likewise, R2, B2 and G2 are the attributes of V2 (this is a network model, mapping which nodes are linked to which nodes and the associated attributes of those nodes, but that's beside the point here). This gives a data.frame that looks like:

  V1 V2  R1  G1  B1  R2  G2  B2
1  A  B 120   0 240 195 195 195
2  B  E 195 195 195  60  15 150
3  C  A 135   0   0 120   0 240
4  D  G  30 195 135  75  15 150
5  B  C 195 195 195 135   0   0
6  A  G 120   0 240  75  15 150
7  A  E 120   0 240  60  15 150
8  D  B  30 195 135 195 195 195

Note that IDs that appear in V1 can also appear in V2. Also note that the associated attributes for each ID is unique and uniform across the dataset.

Now, what I want to do is iterate through each row and calculate a series of means. After the first iteration of this process, my data should look like:

  V1 V2    R1     G1      B1     R2    G2      B2
1  A  B 157.5   97.5   217.5  157.5  97.5   217.5
2  B  E   195    195     195     60    15     150
3  C  A   135      0       0    120     0     240
4  D  G    30    195     135     75    15     150
5  B  C   195    195     195    135     0       0
6  A  G   120      0     240     75    15     150
7  A  E   120      0     240     60    15     150
8  D  B    30    195     135    195   195     195

i.e., R1 and R2 is the mean of its prior two respective values for the first row, 120 and 195. And so on.

Then, that mean is reinserted as the new attributes (matched to ID), and the process starts again with the next row. After this step, my dataset should look like:

  V1 V2    R1     G1      B1     R2    G2      B2
1  A  B 157.5   97.5   217.5  157.5  97.5   217.5
2  B  E 157.5   97.5   217.5     60    15     150
3  C  A   135      0       0  157.5  97.5   217.5
4  D  G    30    195     135     75    15     150
5  B  C 157.5   97.5   217.5    135     0       0
6  A  G 157.5   97.5   217.5     75    15     150
7  A  E 157.5   97.5   217.5     60    15     150
8  D  B    30    195     135  157.5  97.5   217.5

Thus, all attributes for A and B have changed. The process would then continue for the second row (for B and E) and so on. The process continues until the last row is reached.

Here is my code so far. I have not used data.table as that's what I'm trying to figure out. But here it is. It works, but it's extremely slow, making it difficult for me to explore what's going on in the system.

for(i in 1:nrow(cross)){
  Rc <- (cross[i,3]   cross[i,6]) / 2 
  Gc <- (cross[i,4]   cross[i,7]) / 2 
  Bc <- (cross[i,5]   cross[i,8]) / 2 
  V1c <- cross[i,"V1"]
  V2c <- cross[i,"V2"]
  
  cross$R1 <- with(cross, replace(R1, V1 == V1c, Rc)) 
  cross$G1 <- with(cross, replace(G1, V1 == V1c, Gc))  
  cross$B1 <- with(cross, replace(B1, V1 == V1c, Bc)) 
  
  cross$R1 <- with(cross, replace(R1, V1 == V2c, Rc)) 
  cross$G1 <- with(cross, replace(G1, V1 == V2c, Gc))  
  cross$B1 <- with(cross, replace(B1, V1 == V2c, Bc)) 
  
  cross$R2 <- with(cross, replace(R2, V2 == V2c, Rc)) 
  cross$G2 <- with(cross, replace(G2, V2 == V2c, Gc))  
  cross$B2 <- with(cross, replace(B2, V2 == V2c, Bc)) 
  
  cross$R2 <- with(cross, replace(R2, V2 == V1c, Rc)) 
  cross$G2 <- with(cross, replace(G2, V2 == V1c, Gc))  
  cross$B2 <- with(cross, replace(B2, V2 == V1c, Bc)) 
}

This process takes longer than one hour given the size of my data. From what I gather, data.table is supposed to be much faster. I've tried almost everything, ranging from tidyverse stuff to converting into matrices. I've even subsetted the data that needs to be replaced. But I'm having quite a lot of difficulty trying to figure out using data.table for this, which apparently would skip the need to use a for loop.

If it's any help, the slow part of the loop seems to be the section where variables get replaced, not when they are generated.

Thanks in advance!

CodePudding user response：

I'm a big fan of data.table, but I don't think it's needed here. Instead of updating all matching entries in the whole data.frame on each iteration, just update a reference matrix by index.

V1 <- c("A", "B", "C", "D", "B", "A", "A", "D")
V2 <- c("B", "E", "A", "G", "C", "G", "E", "B")

R1 <- c(120, 195, 135, 30, 195, 120, 120, 30)
G1 <- c(0, 195, 0, 195, 195, 0, 0, 195)
B1 <- c(240, 195, 0, 135, 195, 240, 240, 135)

R2 <- c(195, 60, 120, 75, 135, 75, 60, 195)
G2 <- c(195, 15, 0, 15, 0, 15, 15, 195)
B2 <- c(195, 150, 240, 150, 0, 150, 150, 195)

V12 <- c(V1, V2)
uids <- unique(V12)
idx1 <- match(V1, uids)
idx2 <- match(V2, uids)
mRef <- matrix(c(R1, R2, G1, G2, B1, B2), ncol = 3)[match(uids, V12),]

mRef is a matrix of R, G, B values for the unique IDs in V1 and V2. The idx vectors point to the rows of mRef corresponding to the IDs in V1 and V2.

Here's a quick little Rcpp function to iterate through idx1 and idx2 to update mRef.

Rcpp::cppFunction('NumericMatrix updatecross(const IntegerVector& id1, const IntegerVector& id2, NumericMatrix attr) {
  const int idrows = id1.length();
  const int attrcols = attr.ncol();
  double newval = 0;

  for (int col = 0; col < attrcols; col  ) {
    for (int row = 0; row < idrows; row  ) {
      newval = (attr(id1(row), col)   attr(id2(row), col))/2;
      attr(id1(row), col) = newval;
      attr(id2(row), col) = newval;
    }
  }

  return attr;
}')

Update mRef and build the final data.table using the idx vectors.

mRef <- updatecross(idx1 - 1L, idx2 - 1L, mRef)
cross <- cbind(data.frame(V1, V2),
               setNames(cbind(as.data.frame(mRef[idx1,]),
                              as.data.frame(mRef[idx2,])),
                        c("R1", "G1", "B1", "R2", "G2", "B2")))

cross
#>   V1 V2       R1      G1       B1       R2      G2       B2
#> 1  A  B 104.0625 66.5625 154.6875  90.0000 78.7500 144.3750
#> 2  B  E  90.0000 78.7500 144.3750 104.0625 66.5625 154.6875
#> 3  C  A 127.5000 52.5000 146.2500 104.0625 66.5625 154.6875
#> 4  D  G  90.0000 78.7500 144.3750  99.3750 76.8750 125.6250
#> 5  B  C  90.0000 78.7500 144.3750 127.5000 52.5000 146.2500
#> 6  A  G 104.0625 66.5625 154.6875  99.3750 76.8750 125.6250
#> 7  A  E 104.0625 66.5625 154.6875 104.0625 66.5625 154.6875
#> 8  D  B  90.0000 78.7500 144.3750  90.0000 78.7500 144.3750

D and B are everywhere the same because they were the last to be updated. Similarly, A and E are everywhere the same because neither was updated after row 7.

CodePudding user response：

Using igraph (not sure if this will be any faster):

library(igraph)
library(purrr)

vertices <- tibble(
  V = c("A", "B", "C", "D", "G", "E"),
  R = c(120, 195, 135, 30, 75, 60),
  G = c(0, 195, 0, 195, 15, 15),
  B = c(240, 195, 0, 195, 150, 150)
)

edges <- tibble(
  from = c("A", "B", "C", "D", "B",  "A", "A", "D"),
  to = c("B", "E", "A", "G", "C", "G", "E", "B")
)

g <- graph_from_data_frame(edges, vertices = vertices, directed = FALSE)

for(iRow in seq_len(nrow(edges))){
  
  v <- as.character(edges[iRow,])
  
  values <- igraph::vertex.attributes(g, v) %>%
    .[-1] %>%
    map_dbl(mean) 
  
  for(iAttr in names(values))
    vertex_attr(g, iAttr, v) <- values[[iAttr]]
}

as_tibble(vertex.attributes(g))

result:

  name      R     G     B
  <chr> <dbl> <dbl> <dbl>
1 A     104.   66.6  162.
2 B      90    78.8  159.
3 C     128.   52.5  146.
4 D      90    78.8  159.
5 G      99.4  76.9  141.
6 E     104.   66.6  162.

I'm not sure what this process represents, final result is dependent of the sequence of merging described by your data. Can you maybe share your original problem from which this one is derived?