I have a data frame df
ID scores
1 2.1
2 1.3
3 -1
4 -3
5 2.4
I am interested in calculating a difference matrix that contains the difference of each element from the column scores
with every element of the same column (including itself).
My desired output is something like this:
1 2 3 4 5
1 0 0.8 3.1 5.1 -0.3
2 -0.8 0 2.3 4.3 -1.1
3 -3.1 -2.3 0 2 -3.4
4 -5.1 -4.3 -2 0 -5.4
5 0.3 1.1 3.4 5.4 0
The following post is relevant but asks to compute the differences in another way Find the differences in all possible ways of list elements
Is there an easy way to achieve this output, perhaps by using dplyr
or some built-in function? Any help or guidance is greatly appreciated!
CodePudding user response:
You could use outer
, which is a base R function for passing all pairwise combinations of the elements of two vectors to a binary function such as -
:
df <- data.frame(ID = 1:5, scores = c(2.1, 1.3, -1, -3, 2.4))
outer(df$scores, df$scores, `-`)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0 0.8 3.1 5.1 -0.3
[2,] -0.8 0.0 2.3 4.3 -1.1
[3,] -3.1 -2.3 0.0 2.0 -3.4
[4,] -5.1 -4.3 -2.0 0.0 -5.4
[5,] 0.3 1.1 3.4 5.4 0.0
CodePudding user response:
I like outer
(see @allan-cameron's answer) for it's conciseness and efficiency and use it quite often.
To get the difference matrix even faster, we could remove some overhead from outer
,
diffmat <- \(x) {l <- length(x); `-`(x, rep.int(x, rep.int(l, l))) |> array(dim=rep.int(l, 2))}
or apply the logic using Rcpp
.
library(Rcpp)
cppFunction('
NumericMatrix diffmat_rcpp(NumericVector v) {
int n = v.size();
NumericVector z(pow(n, 2));
for (int i = 0; i < n; i ) {
int j = 0;
while (j < n) {
z[i*n j] = v[j] - v[i];
j ;
}
}
z.attr("dim") = Dimension(n, n);
return wrap(z);
}
')
v <- df$scores
diffmat(v)
diffmat_rcpp(v)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.0 0.8 3.1 5.1 -0.3
# [2,] -0.8 0.0 2.3 4.3 -1.1
# [3,] -3.1 -2.3 0.0 2.0 -3.4
# [4,] -5.1 -4.3 -2.0 0.0 -5.4
# [5,] 0.3 1.1 3.4 5.4 0.0
stopifnot(all(sapply(list(diffmat(v), diffmat_rcpp(v)), identical, outer(v, v, `-`))))
Benchmark
v <- rnorm(5e3)
microbenchmark::microbenchmark(outer(v, v, `-`), diffmat(v), diffmat_rcpp(v), times=10L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# outer(v, v, `-`) 419.0008 433.3438 452.3562 458.2913 465.6319 471.5301 10 c
# diffmat(v) 278.3501 313.1150 320.2725 317.4468 340.4437 359.9796 10 b
# diffmat_rcpp(v) 125.8512 131.2582 149.1610 140.4436 161.3835 203.3211 10 a
Data:
df <- structure(list(ID = 1:5, scores = c(2.1, 1.3, -1, -3, 2.4)), class = "data.frame", row.names = c(NA,
-5L))