How can I calculate the column ranks of a large numpy matrix?-CodePudding

Suppose I have the following representation of my data:

import numpy as np
import itertools

N = 100 # number of simulations
vars = 5 # number of variables to simulate 
indices = list(range(vars))

sims = np.random.rand(vars,N) # simulation array of size (5,100)
combinations = list(itertools.combinations(indices,3)) # list of combinations of variables of length 3

# create a matrix which sums the variable simulations for each combination
combo_sims = np.empty((len(combinations),sims.shape[1])) 
for i in range(len(combinations)):
    combo_sims[i] = np.sum(sims[list(combinations[i])],axis=0)

How can I get the ordered ranks for each row by column? The expected result would be a matrix of the same shape as combo_sims but with ranks instead of the sum of sims. For example,

[[1,0,2],[0,2,1],[2,1,0]]

should return:

[[2,1,3],[1,3,2],[3,2,1]]

My actual data includes ~1,500,000 combinations and 10,000 simulations. What is the fastest way to get the results from question 1, keeping in mind memory utilization?
Is there a more efficient way to set the combo_sims matrix?

CodePudding user response：

You can use scipy.stats.rankdata or np.argsort

If you are not interested in special handling of ties then use np.argsort because it is faster than scipy.stats.rankdata

1.5M x 10K float64 numbers = 120 GB memory. if you do not have enough RAM you can process by chunks and save results on disk.
If your combinations have fixed length = 3 then you can a) precompute some calculations b) If it possible use float32 numbers instead of float64 because it will work faster and takes 2x lesser memory.

I've modified your code and it became 5x times faster:

Number of combinations: 447580
Number of simulations: 10000
original: 10.793719053268433 seconds
with optimizations: 2.0908007621765137 seconds

Example:

import itertools
import time

import numpy as np

N = 10_000  # number of simulations
n_vars = 140  # number of variables to simulate
indices = list(range(n_vars))

sims = np.random.rand(n_vars, N)  # simulation array of size (5,100)


combinations = list(
    itertools.combinations(indices, 3)
)  # list of combinations of variables of length 3
print(f"Number of combinations: {len(combinations)}")
print(f"Number of simulations: {N}")

################################# Original version

start_time = time.time()
# create a matrix which sums the variable simulations for each combination
combo_sims = np.empty((len(combinations), sims.shape[1]))
for i in range(len(combinations)):
    combo_sims[i] = np.sum(sims[list(combinations[i])], axis=0)
print(f"original: {time.time() - start_time} seconds")

################################# Modified version

start_time = time.time()

sims = sims.astype(np.float32)
combinations = np.array(combinations, int)
combo_sims2 = np.empty((len(combinations), sims.shape[1]), np.float32)
pairs = np.array(list(itertools.combinations(range(n_vars), 2)), int)
n = 0
buf = np.zeros(sims.shape[1], np.float32)
for i, j in pairs:
    np.add(sims[i], sims[j], buf)
    for k in range(j   1, n_vars):
        np.add(buf, sims[k], combo_sims2[n])
        n  = 1

print(f"with optimizations: {time.time() - start_time} seconds")

assert np.allclose(combo_sims, combo_sims2)