Efficient ways to aggregate and replicate values in a numpy matrix-CodePudding

In my work I often need to aggregate and expand matrices of various quantities, and I am looking for the most efficient ways to do these actions. E.g. I'll have an NxN matrix that I want to aggregate from NxN into PxP where P < N. This is done using a correspondence between the larger dimensions and the smaller dimensions. Usually, P will be around 100 or so.

For example, I'll have a hypothetical 4x4 matrix like this (though in practice, my matrices will be much larger, around 1000x1000)

m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

>>> m
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

and a correspondence like this:

0,0
1,1
2,0
3,1

that I usually store in a dictionary. This means that indices 0 and 2 both get allocated to new index 0 and indices 1 and 3 both get allocated to new index 1. The matrix could be anything at all, but the correspondence is always many-to-one when I want to compress.

The aggregation process here would lead to:

array([[ 24, 28 ],
       [ 40, 44 ]])

I can do this by making an empty matrix of the right size and looping over all 4x4=16 cells of the initial matrix and accumulating in nested loops, but this seems to be inefficient and the vectorised nature of numpy is always emphasised by people. I have also done it by using np.ix_ to make sets of indices and use m[row_indices, col_indices].sum(), but I am wondering what the most efficient numpy-like way to do it is.

Conversely, what is the sensible and efficient way to expand a matrix using the correspondence the other way? For example with the same correspondence but in reverse I would go from:

array([[ 1, 2 ],
       [ 3, 4 ]])

array([[ 1, 2, 1, 2 ],
       [ 3, 4, 3, 4 ],
       [ 1, 2, 1, 2 ],
       [ 3, 4, 3, 4 ]])

where the values simply get replicated into the new cells.

In my attempts so far for the aggregation, I have used approaches with pandas methods with groupby on index and columns and then extracting the final matrix with, e.g. df.values. However, I don't know the equivalent way to expand a matrix, without using a lot of things like unstack and join and so on. And I see people often say that using pandas is not time-efficient.

Edit 1: I was asked in a comment about exactly how the aggregation should be done. This is how it would be done if I were using nested loops and a dictionary lookup between the original dimensions and the new dimensions:

>>> m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])  
>>> mnew=np.zeros((2,2))  
>>> big2small={0:0, 1:1, 2:0, 3:1}  
>>> for i in range(4):
...     inew = big2small[i]
...     for j in range(4):
...         jnew = big2small[j]
...         mnew[inew, jnew]  = m[i, j]
...
>>> mnew
array([[24., 28.],
       [40., 44.]])

CodePudding user response：

Assuming you don't your indices don't have a regular structure I would do it try sparse matrices.

import scipy.sparse as ss
import numpy as np
# your current array of indices
g=np.array([[0,0],[1,1],[2,0],[3,1]])
# a sparse matrix of (data=ones, (row_ind=g[:,0], col_ind=g[:,1]))
# it is one for every pair (g[i,0], g[i,1]), zero elsewhere
u=ss.csr_matrix((np.ones(len(g)), (g[:,0], g[:,1])))

Aggregate

m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
u.T @ m @ u

Expand

m2 = np.array([[1,2],[3,4]])
u @ m2 @ u.T