Home > Back-end >  Any alternate approaches to calculate co-occurrence matrix in Python?
Any alternate approaches to calculate co-occurrence matrix in Python?

Time:11-13

I'm trying to calculate the co-occurrence matrix for a large corpus but it takes a very long time( 6hours). Are there any faster ways?

My approach:

consider this array as the corpus and each element of the corpus as context:

corpus = [
    'where python is used',
    'what is python used in',
    'why python is best',
    'what companies use python'
]

Algorithm:

words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')

for context in corpus:
    context = context.split(' ')
    for i in range(len(context)):
        for j in range(i   1, len(context)):
            row = words.index(context[i])
            column = words.index(context[j])
            c_matrix[row][column]  = 1

CodePudding user response:

The provided algorithm is not efficient because it recompute words.index(...) a lot of time. You can pre-compute the indices first and then build the matrix. Here is a significantly better solution:

words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')

for context in corpus:
    context = context.split(' ')
    index = [words.index(item) for item in context]
    for i in range(len(context)):
        for j in range(i   1, len(context)):
            c_matrix[index[i]][index[j]]  = 1

Moreover, you can transform index to a Numpy array and use Numba (or Cython) to build the c_matrix very quickly from index.

Finally, you can transform words to a dictionary (with the string in the current list as the dictionary keys and the index in the current list as the dictionary values) so that indexing will be much faster (constant-time fetch).

The resulting algorithm should be several order of magnitude faster. If this is not enough, then you probably need to replace the matrix c_matrix with a more advanced (but also much more complex) sparse data-structure regarding your needs.

  • Related