Is there anyway to make this code more efficient? I am running it on a sparse matrix (m) that is aro-CodePudding

from collections import Counter
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import pairwise_distances
import sys
import heapq
from .kmeanspp import kmeanspp
from .utils import log
from fbpca import pca
import math as math
import statistics
import random
import operator as op
import time

lil=m.tolil()
gene_avg=[]
cell_rndm=random.sample(range(lil.shape[0]), round(lil.shape[0]*rndm_cell_pcnt))
print('calculating gene averages')
for gene in range(lil.shape[1]):
    print('calculating mean')
    gene_avg.append(lil[:,gene].todense().mean())
print('setting chosen cells to cell type random. changing gene expression')
for cell in cell_rndm:
    nonzero=set(lil[cell,:].nonzero()[1])
    rndm_nonzero=random.sample(nonzero,round(len(nonzero)*gene_prcnt))
    zero =list(set(list(range(lil.shape[1])))-nonzero)
    rndm_zero=random.sample(zero,round(len(zero)*gene_prcnt))
    print('setting celltype to random')
    labels[cell] = 'random'
    print('reraranging some gene expresion')
    lil[cell,rndm_nonzero]=0.0
    lil[cell,rndm_zero] = list(itemgetter(*rndm_zero)(gene_avg))

In the first for loop, I am giving all 30k genes their mean expression. In the second I am going through all 190k cells and setting some nonzero genes to zero and zero ones to the average. This process takes a very long time.

CodePudding user response：

for gene in range(lil.shape[1]):
    print('calculating mean')
    gene_avg.append(lil[:,gene].todense().mean())

should be replaceable with

arr = lil.A  # dense array
gene_avg = arr.mean(axis=1)

With numpy random, you should be able to take all 'random-samples' at once. I haven't studied your code enough to give you the details.

While lil assignment is the best for sparse matrices, assigning values to dense arrays is better - if they fit in memory.

Trying to understand the action:

For one cell (random list, no repeats), hence lil[cell,:] is a row of the lil.

Get the nonzero indices, as a set. set(lil.rows[cell]) may do the same thing:

nonzero=set(lil[cell,:].nonzero()[1])

then get a random sample of those: rndm_nonzero=random.sample(nonzero,round(len(nonzero)*gene_prcnt))

And a random sample of the zeros, using set difference. Use of set feels like it might be slower than needed, but I haven't worked out an alternative.

zero =list(set(list(range(lil.shape[1])))-nonzero)
rndm_zero=random.sample(zero,round(len(zero)*gene_prcnt))

Prints are great for debugging, but they do slow down the run.

print('setting celltype to random')
labels[cell] = 'random'
print('reraranging some gene expresion')

And finally set some elements of that row to 0

lil[cell,rndm_nonzero]=0.0

If gene_avg is an array, then you can use gene_avg[rndm_zero] instead of this itemgetter. I don't think itemgetter is any faster than [gene_avg[i] for i in rndm_zero].

lil[cell,rndm_zero] = list(itemgetter(*rndm_zero)(gene_avg))

While it would be nice to work with all sampled rows at once

arr[cell_rndm]

it would take a lot of work to get the details right. So for a start I'd focus on streamlining the row by row operation.