I am working with a 3072 x 3072 sparse matrix which supposedly has some structured blocks,some blocks of zero and some unstructured blocks.Right now I have extracted all the non zero indices of the matrix and arranged column indices for every row having non zero values .I am trying to find clusters in these non zero elements by using excel and plotting 25 rows each for columns indices that appear to be closer.Since the matrix is a large one ,this is quite tedious and time consuming.
Could anyone please suggest an efficient way to find and visualize such structure in the matrix?
CodePudding user response:
You can use the spy
function to plot a sparse matrix. Documentation here.
I would point out that 3072 x 3072 is not particularly large. You could convert it to a dense matrix and plot it using imagesc
.
CodePudding user response:
3072x3072 is not particularly large, all things considered. Even using 64 bit numbers, it will only take 75.5 MB or so of memory. You should not have any issues with representing it as a dense matrix on modern hardware. My Galaxy S8 phone is able to invert a matrix of that size in 4.6 seconds using Python and numpy, so you should have a wealth of options available from a computational perspective.
What kind of data does this represent? Is it positive and does it make sense to interpret it as a graph or network? How noisy is it? Specifically, do rows and columns in a given block "connect" to other blocks ever, or are they completely disjoint? If it's the latter, you can use any algorithm for search in a graph you like to find the connected components in the graph (grouping rows and columns into sets that cannot be reached by the other sets). If the former, it's a bit more complicated.
You can always visualize the matrix as a heatmap- darker grid squares for larger elements, white for zeroes. But this won't be helpful in visualization unless the rows and columns are ordered in such a way that the blocks are visually distinct. Ordering so that connected components are contiguous is a good option if it's clear, but if not you'll need to look at more sophisticated methods. Reverse Cuthill-Mcgee is an algorithm for ordering the rows and columns in a way that minimizes the bandwidth of the graph- trying to keep nonzero entries as close to the diagonal as possible. This is easily implemented with a breadth first search if you don't have a prebaked option. If that doesn't do the trick, I would recommend looking into graph partitioning methods, which can address cases where blocks are not completely disjoint. Spectral clustering is easily implemented with an eigendecomposition followed by K means, and it has the benefit of being fast to compute the decomposition (same complexity as inversion) and determining the correct number of clusters will be feasible with an exhaustive search, since you won't need to run the decomposition each time, only k means. Then you can reorder the rows and columns to put those in the same cluster together and visualize with a heatmap.
Alternatively, you could use any of the good network visualization tools to directly draw a graph with nodes placed sensibly, but that may not be very interpretable with this many nodes depending on the level of detail you need.