Home > OS >  Difference Matrix between all elements of a column
Difference Matrix between all elements of a column

Time:08-24

I have a data frame df

ID  scores
1   2.1
2   1.3
3   -1
4   -3
5   2.4

I am interested in calculating a difference matrix that contains the difference of each element from the column scores with every element of the same column (including itself).

My desired output is something like this:

    1     2    3    4    5
1   0     0.8  3.1  5.1 -0.3
2  -0.8   0    2.3  4.3 -1.1
3  -3.1  -2.3  0    2   -3.4
4  -5.1  -4.3 -2    0   -5.4
5   0.3   1.1  3.4  5.4  0

The following post is relevant but asks to compute the differences in another way Find the differences in all possible ways of list elements

Is there an easy way to achieve this output, perhaps by using dplyr or some built-in function? Any help or guidance is greatly appreciated!

CodePudding user response:

You could use outer, which is a base R function for passing all pairwise combinations of the elements of two vectors to a binary function such as -:

df <- data.frame(ID = 1:5, scores = c(2.1, 1.3, -1, -3, 2.4))

outer(df$scores, df$scores, `-`)
     [,1] [,2] [,3] [,4] [,5]
[1,]  0.0  0.8  3.1  5.1 -0.3
[2,] -0.8  0.0  2.3  4.3 -1.1
[3,] -3.1 -2.3  0.0  2.0 -3.4
[4,] -5.1 -4.3 -2.0  0.0 -5.4
[5,]  0.3  1.1  3.4  5.4  0.0

CodePudding user response:

I like outer (see @allan-cameron's answer) for it's conciseness and efficiency and use it quite often.

To get the difference matrix even faster, we could remove some overhead from outer,

diffmat <- \(x) {l <- length(x); `-`(x, rep.int(x, rep.int(l, l))) |> array(dim=rep.int(l, 2))}

or apply the logic using Rcpp.

library(Rcpp)
cppFunction('
NumericMatrix diffmat_rcpp(NumericVector v) {
  int n = v.size();
  NumericVector z(pow(n, 2));
  for (int i = 0; i < n; i  ) {
    int j = 0;
    while (j < n) {
      z[i*n   j] = v[j] - v[i];
      j  ;
    }
  }
  z.attr("dim") = Dimension(n, n);
  return wrap(z);
  }
')


v <- df$scores


diffmat(v)
diffmat_rcpp(v)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]  0.0  0.8  3.1  5.1 -0.3
# [2,] -0.8  0.0  2.3  4.3 -1.1
# [3,] -3.1 -2.3  0.0  2.0 -3.4
# [4,] -5.1 -4.3 -2.0  0.0 -5.4
# [5,]  0.3  1.1  3.4  5.4  0.0


stopifnot(all(sapply(list(diffmat(v), diffmat_rcpp(v)), identical, outer(v, v, `-`))))

Benchmark

v <- rnorm(5e3)
microbenchmark::microbenchmark(outer(v, v, `-`), diffmat(v), diffmat_rcpp(v), times=10L)
# Unit: milliseconds
#             expr      min       lq     mean   median       uq      max neval cld
# outer(v, v, `-`) 419.0008 433.3438 452.3562 458.2913 465.6319 471.5301    10   c
#       diffmat(v) 278.3501 313.1150 320.2725 317.4468 340.4437 359.9796    10  b 
#  diffmat_rcpp(v) 125.8512 131.2582 149.1610 140.4436 161.3835 203.3211    10 a  

Data:

df <- structure(list(ID = 1:5, scores = c(2.1, 1.3, -1, -3, 2.4)), class = "data.frame", row.names = c(NA, 
-5L))
  • Related