I'm working on document clustering where I first build a distance matrix from the tf-idf results. I use the below code to get my tf-idf matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(models)
This results in a matrix of (9069, 22210). Now I want to build a distance matrix from this (9069*9069). I'm using the following code for that:
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from scipy.spatial import distance
arrX = X.toarray()
rowSize = X.shape[0]
distMatrix = np.zeros(shape=(rowSize, rowSize))
#build distance matrix
for i, x in enumerate(arrX):
for j, y in enumerate(arrX):
distMatrix[i][j] = distance.braycurtis(x, y)
np.savetxt("dist.csv", distMatrix, delimiter=",")
The problem with this code is that it's extremely slow for this matrix size. Is there a faster way of doing this?
CodePudding user response:
The biggest issue is that the algorithms runs in O(n^3)
since each iteration to distance.braycurtis
requires the computation of two arrays of size 9069. Since the computation is done 9069*9069 times. This means thousands billion scalar operations are required so to complete the computation. This is huge. The thing is the complexity of the algorithm probably cannot be improved. There are several ways to speed this up:
The first thing to do is not to recompute the distance twice. Indeed, this distance seems to be a commutative operator so distMatrix[i][j] == distMatrix[j][i]
. You can compute the upper triangular part and then copy it to the lower triangular part.
Another optimization is simply not to use distance.braycurtis
because it is slow: it takes about 10 us/call on my machine. This is mainly because it creates several temporary arrays, is mostly memory-bound because of Numpy operations, and also because np.sum
is not very fast (mainly because it uses of a pretty precise algorithm that is hard to optimize). Moreover, it is sequential while nearly all mainstream processor have multiple cores nowadays. We can use Numba so to massively speed up this operation:
import numba as nb
@nb.njit(['float32(float32[::1], float32[::1])', 'float64(float64[::1], float64[::1])'], fastmath=True)
def fastBrayCurtis(arr1, arr2):
assert arr1.size == arr2.size
assert arr1.size > 0
zero = arr1[0] * 0 # Trick to set `zero` to the right type regarding the one of `arr1`
df, sm = zero, zero
for k in range(arr1.size):
df = np.abs(arr1[k] - arr2[k])
sm = np.abs(arr1[k] arr2[k])
return df / sm
# The signature of the function is provided so to compile the function eagerly
# with both 32-bit and 64-bit floating-point 2D contiguous arrays.
@nb.njit(['float32[:,::1](float32[:,::1])', 'float64[:,::1](float64[:,::1])'], fastmath=True, parallel=True)
def brayCurtisDistMatrix(arr):
n = arr.shape[0]
assert arr.shape[1] == n
distance = np.empty((n, n), dtype=arr.dtype)
# Compute the distance matrix in parallel while balancing the work between threads
for i in nb.prange((n 1)//2):
# Top of the upper triangular part (many items)
for j in range(i, n):
distance[j, i] = distance[i, j] = fastBrayCurtis(arr[i], arr[j])
# Bottom of the upper triangular part (few items)
for j in range(n-1-i, n):
distance[j, n-1-i] = distance[n-1-i, j] = fastBrayCurtis(arr[n-1-i], arr[j])
return distance
This code is about 440 times faster than the initial one on my 6-core i5-9600KF processor. Actually, a quick theoretical analysis combined with profiling results shows that the algorithm is close to be optimal (>75% of the computing power of my processor is used)! If this is not enough, you should consider using the simple-precision implementation. If this is still not enough, you should then also consider writing an optimized GPU code for that (or simply reconsider the need to compute such a huge distance matrix).
CodePudding user response:
You see, the individual elements of the NumPy multidimensional matrix you give in as input are saved in memory in 2 ways. They are: ROW MAJOR COLUMN MAJOR Each has its advantages and disadvantages. You can even control the way it is stored. I hope you find this helpful