Home > Software design >  Numpy array vs list of lists - editing values one by one (help implementing)
Numpy array vs list of lists - editing values one by one (help implementing)

Time:12-03

I'm working with DNA sequence alignments and trying to implement a simple scoring algorithm. Since i have to use a matrix for the calculations, i thought numpy should be way faster than a list of lists, but as I tested both, the python lists seem to be way faster. I found this thread (Why use numpy over list based on speed?) but still; i'm using preallocated numpy vs preallocated lists and list of lists are the clear winners. Here is my code:

Lists

def edirDistance(x, y):
    x_dim = len(x) 1
    y_dim = len(y) 1
    D = []
    for i in range(x_dim):
        D.append([0] * (y_dim))
    
    #Filling the matrix borders
    for i in range(x_dim):
        D[i][0] = i
    for i in range(y_dim):
        D[0][i] = i
    
    for i in range(1, x_dim):
        for j in range(1, y_dim):
            distHor = D[i][j-1]   1
            distVer = D[i-1][j]   1
            if x[i-1] == y[j-1]:
                distDiag = D[i-1][j-1]
            else:
                distDiag = D[i-1][j-1]   1
            D[i][j] = min(distHor, distVer,distDiag)
    return D

Numpy

def NP_edirDistance(x, y):
    x_dim = len(x) 1
    y_dim = len(y) 1
    D = np.zeros((x_dim,y_dim))
    
    #Filling the matrix borders
    for i in range(x_dim):
        D[i][0] = i
    for i in range(y_dim):
        D[0][i] = i
    
    for i in range(1, x_dim):
        for j in range(1, y_dim):
            distHor = D[i][j-1]   1
            distVer = D[i-1][j]   1
            if x[i-1] == y[j-1]:
                distDiag = D[i-1][j-1]
            else:
                distDiag = D[i-1][j-1]   1
            D[i][j] = min(distHor, distVer,distDiag)
    return D

I'm not timing the np import.

a = 'ACGTACGACTATCGACTAGCTACGAA'
b = 'ACCCACGTATAACGACTAGCTAGGGA'
%%time
edirDistance(a, b)

total: 1.41 ms

%%time
NP_edirDistance(a, b)

total: 4.43 ms

Replacing D[i][j] by D[i,j] greatly improved time, but still slower. (Thanks @Learning is a mess !)

total: 2.64 ms

I tested with even larger DNA sequences (around 10.000 letters each) and still lists are winning.

Can someone help me improve timing?

Are lists better for this use?

CodePudding user response:

One way to have faster run is to use GPU/TPU-aided accelerators such as numba and …. I have tested your codes by that a and b on google colab TPU without using accelerators:

1000 loops, best of 5: 563 µs per loop
1000 loops, best of 5: 1.95 ms per loop    # NumPy

But with using numba as nopython=True, without any changes to your codes:

import numba as nb

@nb.njit()
def edirDistance(x, y):
       .
       .

@nb.njit()
def NP_edirDistance(x, y):
       .
       .

It gets:

1000 loops, best of 5: 213 µs per loop
1000 loops, best of 5: 153 µs per loop    # NumPy

Which will get significant difference between them using huge samples or by improving and vectorizing your NumPy codes. This method results as below for samples with 10000 length:

35.50053691864014
22.95994758605957    # NumPy (seconds)
  • Related