I've been racking my head over this over the past few days. I have a dataset that appears as follows:
V1 <- c("A", "B", "C", "D", "B", "A", "A", "D")
V2 <- c("B", "E", "A", "G", "C", "G", "E", "B")
R1 <- c(120, 195, 135, 30, 195, 120, 120, 30)
G1 <- c(0, 195, 0, 195, 195, 0, 0, 195)
B1 <- c(240, 195, 0, 135, 195, 240, 240, 135)
R2 <- c(195, 60, 120, 75, 135, 75, 60, 195)
G2 <- c(195, 15, 0, 15, 0, 15, 15, 195)
B2 <- c(195, 150, 240, 150, 0, 150, 150, 195)
cross <- data.frame(V1,V2,R1,G1,B1,R2,G2,B2)
This is a grossly simplified version of my actual dataset, which is a lot larger with over 60,000 observations. In the first two columns, V1
and V2
denote ID variables. Then R1
, B1
and G1
correspond to the attributes associated with V1
and, likewise, R2
, B2
and G2
are the attributes of V2
(this is a network model, mapping which nodes are linked to which nodes and the associated attributes of those nodes, but that's beside the point here). This gives a data.frame that looks like:
V1 V2 R1 G1 B1 R2 G2 B2
1 A B 120 0 240 195 195 195
2 B E 195 195 195 60 15 150
3 C A 135 0 0 120 0 240
4 D G 30 195 135 75 15 150
5 B C 195 195 195 135 0 0
6 A G 120 0 240 75 15 150
7 A E 120 0 240 60 15 150
8 D B 30 195 135 195 195 195
Note that IDs that appear in V1
can also appear in V2
. Also note that the associated attributes for each ID is unique and uniform across the dataset.
Now, what I want to do is iterate through each row and calculate a series of means. After the first iteration of this process, my data should look like:
V1 V2 R1 G1 B1 R2 G2 B2
1 A B 157.5 97.5 217.5 157.5 97.5 217.5
2 B E 195 195 195 60 15 150
3 C A 135 0 0 120 0 240
4 D G 30 195 135 75 15 150
5 B C 195 195 195 135 0 0
6 A G 120 0 240 75 15 150
7 A E 120 0 240 60 15 150
8 D B 30 195 135 195 195 195
i.e., R1
and R2
is the mean of its prior two respective values for the first row, 120
and 195
. And so on.
Then, that mean is reinserted as the new attributes (matched to ID), and the process starts again with the next row. After this step, my dataset should look like:
V1 V2 R1 G1 B1 R2 G2 B2
1 A B 157.5 97.5 217.5 157.5 97.5 217.5
2 B E 157.5 97.5 217.5 60 15 150
3 C A 135 0 0 157.5 97.5 217.5
4 D G 30 195 135 75 15 150
5 B C 157.5 97.5 217.5 135 0 0
6 A G 157.5 97.5 217.5 75 15 150
7 A E 157.5 97.5 217.5 60 15 150
8 D B 30 195 135 157.5 97.5 217.5
Thus, all attributes for A
and B
have changed. The process would then continue for the second row (for B
and E
) and so on. The process continues until the last row is reached.
Here is my code so far. I have not used data.table
as that's what I'm trying to figure out. But here it is. It works, but it's extremely slow, making it difficult for me to explore what's going on in the system.
for(i in 1:nrow(cross)){
Rc <- (cross[i,3] cross[i,6]) / 2
Gc <- (cross[i,4] cross[i,7]) / 2
Bc <- (cross[i,5] cross[i,8]) / 2
V1c <- cross[i,"V1"]
V2c <- cross[i,"V2"]
cross$R1 <- with(cross, replace(R1, V1 == V1c, Rc))
cross$G1 <- with(cross, replace(G1, V1 == V1c, Gc))
cross$B1 <- with(cross, replace(B1, V1 == V1c, Bc))
cross$R1 <- with(cross, replace(R1, V1 == V2c, Rc))
cross$G1 <- with(cross, replace(G1, V1 == V2c, Gc))
cross$B1 <- with(cross, replace(B1, V1 == V2c, Bc))
cross$R2 <- with(cross, replace(R2, V2 == V2c, Rc))
cross$G2 <- with(cross, replace(G2, V2 == V2c, Gc))
cross$B2 <- with(cross, replace(B2, V2 == V2c, Bc))
cross$R2 <- with(cross, replace(R2, V2 == V1c, Rc))
cross$G2 <- with(cross, replace(G2, V2 == V1c, Gc))
cross$B2 <- with(cross, replace(B2, V2 == V1c, Bc))
}
This process takes longer than one hour given the size of my data. From what I gather, data.table
is supposed to be much faster. I've tried almost everything, ranging from tidyverse stuff to converting into matrices. I've even subsetted the data that needs to be replaced. But I'm having quite a lot of difficulty trying to figure out using data.table
for this, which apparently would skip the need to use a for
loop.
If it's any help, the slow part of the loop seems to be the section where variables get replaced, not when they are generated.
Thanks in advance!
CodePudding user response:
I'm a big fan of data.table
, but I don't think it's needed here. Instead of updating all matching entries in the whole data.frame
on each iteration, just update a reference matrix by index.
V1 <- c("A", "B", "C", "D", "B", "A", "A", "D")
V2 <- c("B", "E", "A", "G", "C", "G", "E", "B")
R1 <- c(120, 195, 135, 30, 195, 120, 120, 30)
G1 <- c(0, 195, 0, 195, 195, 0, 0, 195)
B1 <- c(240, 195, 0, 135, 195, 240, 240, 135)
R2 <- c(195, 60, 120, 75, 135, 75, 60, 195)
G2 <- c(195, 15, 0, 15, 0, 15, 15, 195)
B2 <- c(195, 150, 240, 150, 0, 150, 150, 195)
V12 <- c(V1, V2)
uids <- unique(V12)
idx1 <- match(V1, uids)
idx2 <- match(V2, uids)
mRef <- matrix(c(R1, R2, G1, G2, B1, B2), ncol = 3)[match(uids, V12),]
mRef
is a matrix of R
, G
, B
values for the unique IDs in V1
and V2
. The idx
vectors point to the rows of mRef
corresponding to the IDs in V1
and V2
.
Here's a quick little Rcpp
function to iterate through idx1
and idx2
to update mRef.
Rcpp::cppFunction('NumericMatrix updatecross(const IntegerVector& id1, const IntegerVector& id2, NumericMatrix attr) {
const int idrows = id1.length();
const int attrcols = attr.ncol();
double newval = 0;
for (int col = 0; col < attrcols; col ) {
for (int row = 0; row < idrows; row ) {
newval = (attr(id1(row), col) attr(id2(row), col))/2;
attr(id1(row), col) = newval;
attr(id2(row), col) = newval;
}
}
return attr;
}')
Update mRef
and build the final data.table
using the idx
vectors.
mRef <- updatecross(idx1 - 1L, idx2 - 1L, mRef)
cross <- cbind(data.frame(V1, V2),
setNames(cbind(as.data.frame(mRef[idx1,]),
as.data.frame(mRef[idx2,])),
c("R1", "G1", "B1", "R2", "G2", "B2")))
cross
#> V1 V2 R1 G1 B1 R2 G2 B2
#> 1 A B 104.0625 66.5625 154.6875 90.0000 78.7500 144.3750
#> 2 B E 90.0000 78.7500 144.3750 104.0625 66.5625 154.6875
#> 3 C A 127.5000 52.5000 146.2500 104.0625 66.5625 154.6875
#> 4 D G 90.0000 78.7500 144.3750 99.3750 76.8750 125.6250
#> 5 B C 90.0000 78.7500 144.3750 127.5000 52.5000 146.2500
#> 6 A G 104.0625 66.5625 154.6875 99.3750 76.8750 125.6250
#> 7 A E 104.0625 66.5625 154.6875 104.0625 66.5625 154.6875
#> 8 D B 90.0000 78.7500 144.3750 90.0000 78.7500 144.3750
D and B are everywhere the same because they were the last to be updated. Similarly, A and E are everywhere the same because neither was updated after row 7.
CodePudding user response:
Using igraph
(not sure if this will be any faster):
library(igraph)
library(purrr)
vertices <- tibble(
V = c("A", "B", "C", "D", "G", "E"),
R = c(120, 195, 135, 30, 75, 60),
G = c(0, 195, 0, 195, 15, 15),
B = c(240, 195, 0, 195, 150, 150)
)
edges <- tibble(
from = c("A", "B", "C", "D", "B", "A", "A", "D"),
to = c("B", "E", "A", "G", "C", "G", "E", "B")
)
g <- graph_from_data_frame(edges, vertices = vertices, directed = FALSE)
for(iRow in seq_len(nrow(edges))){
v <- as.character(edges[iRow,])
values <- igraph::vertex.attributes(g, v) %>%
.[-1] %>%
map_dbl(mean)
for(iAttr in names(values))
vertex_attr(g, iAttr, v) <- values[[iAttr]]
}
as_tibble(vertex.attributes(g))
result:
name R G B
<chr> <dbl> <dbl> <dbl>
1 A 104. 66.6 162.
2 B 90 78.8 159.
3 C 128. 52.5 146.
4 D 90 78.8 159.
5 G 99.4 76.9 141.
6 E 104. 66.6 162.
I'm not sure what this process represents, final result is dependent of the sequence of merging described by your data. Can you maybe share your original problem from which this one is derived?