I'm working with DNA sequence alignments and trying to implement a simple scoring algorithm. Since i have to use a matrix for the calculations, i thought numpy should be way faster than a list of lists, but as I tested both, the python lists seem to be way faster. I found this thread (Why use numpy over list based on speed?) but still; i'm using preallocated numpy vs preallocated lists and list of lists are the clear winners. Here is my code:
Lists
def edirDistance(x, y):
x_dim = len(x) 1
y_dim = len(y) 1
D = []
for i in range(x_dim):
D.append([0] * (y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] 1
distVer = D[i-1][j] 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] 1
D[i][j] = min(distHor, distVer,distDiag)
return D
Numpy
def NP_edirDistance(x, y):
x_dim = len(x) 1
y_dim = len(y) 1
D = np.zeros((x_dim,y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] 1
distVer = D[i-1][j] 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] 1
D[i][j] = min(distHor, distVer,distDiag)
return D
I'm not timing the np import.
a = 'ACGTACGACTATCGACTAGCTACGAA'
b = 'ACCCACGTATAACGACTAGCTAGGGA'
%%time
edirDistance(a, b)
total: 1.41 ms
%%time
NP_edirDistance(a, b)
total: 4.43 ms
Replacing D[i][j]
by D[i,j]
greatly improved time, but still slower. (Thanks @Learning is a mess !)
total: 2.64 ms
I tested with even larger DNA sequences (around 10.000 letters each) and still lists are winning.
Can someone help me improve timing?
Are lists better for this use?
CodePudding user response:
One way to have faster run is to use GPU/TPU-aided accelerators such as numba and …. I have tested your codes by that a
and b
on google colab TPU without using accelerators:
1000 loops, best of 5: 563 µs per loop
1000 loops, best of 5: 1.95 ms per loop # NumPy
But with using numba as nopython=True
, without any changes to your codes:
import numba as nb
@nb.njit()
def edirDistance(x, y):
.
.
@nb.njit()
def NP_edirDistance(x, y):
.
.
It gets:
1000 loops, best of 5: 213 µs per loop
1000 loops, best of 5: 153 µs per loop # NumPy
Which will get significant difference between them using huge samples or by improving and vectorizing your NumPy codes. This method results as below for samples with 10000 length:
35.50053691864014
22.95994758605957 # NumPy (seconds)