Reduce the time complexity while populating data into a matrix python-CodePudding

I have a core_df dataframe. Each row has to be compared against other and the result has to be stored in a matrix. The conditions are if the start or end nodes do not match or if the length ratio is less than 0.2, 0 has to be populated else some value(say 100). The problem is the dataframe can be huge. It can result in millions of records. What I have done to save memory is :

Initialise a sparse dok matrix
Populate values only in upper diagnol matrix( as it will be mirror image for the lower)
changed the datatype of the matrix so that it consumes less bytes. Any other suggestions to improve the memory allocation would be welcome.

I have tried the below code, though it is returning result, its taking huge amount of time on large dataset. Is there a way to optimise the time complexity?

import pandas as pd
import scipy
from scipy.sparse import dok_matrix
core_df = pd.DataFrame({'sno':[1,2,3,4],'start':[1,3,5,5],'end':[5,14,17,27],'start_elements':[('A', 'B'),('X','Y'),('B', 'C'),('B', 'C')],'end_elements':[('L', 'M'),('S', 'T'),('N', 'P'),('N', 'P')]})
 
core_df_count=len(core_df)    
Score_matrix = dok_matrix((core_df_count, core_df_count))

for i in range(0, core_df_count):
    start1 = core_df.iloc[i, 1]
    end1 = core_df.iloc[i, 2]
    Start_elements1 = core_df.iloc[i, 3]
    End_elements1 = core_df.iloc[i, 4]

    for j in range(i   1, core_df_count):
        start2 = core_df.iloc[j, 1]
        end2 = core_df.iloc[j, 2]
        Start_elements2 = core_df.iloc[j, 3]
        End_elements2 = core_df.iloc[j, 4]

        if Start_elements1 != Start_elements2 or End_elements1 != End_elements2:
            Score_matrix[i, j] = 0

        else:
            ratio = (end1 - start1) / (end2 - start2)
            if ratio <0.2:
                Score_matrix[i, j] = 0
            else:
                Score_matrix[i,j]=100
Score_matrix_2D = pd.DataFrame(Score_matrix.todense())
Score_matrix_2D

CodePudding user response：

First I group df based on start_elements & end_elements and search for only such
In each group I calculate lengths and make dense matrix comparing lengths (ratio).
I search for which df indices ratio is >= 0.2
I do not assign 0 to sparse matrix, just 100 where it is necessary. There is 0 everywhere by default

import numpy as np
import pandas as pd
from scipy.sparse import dok_matrix

core_df = pd.DataFrame({'sno': [1, 2, 3, 4], 'start': [1, 3, 5, 5], 'end': [5, 14, 17, 27],
                        'start_elements': [('A', 'B'), ('X', 'Y'), ('B', 'C'), ('B', 'C')],
                        'end_elements': [('L', 'M'), ('S', 'T'), ('N', 'P'), ('N', 'P')]})

core_df_count = len(core_df)
Score_matrix = dok_matrix((core_df_count, core_df_count))
core_df_values = core_df.values

for (start_elements, end_elements), group_df in core_df.groupby(['start_elements', 'end_elements']):
    lengths = (group_df['end'] - group_df['start']).values
    ratios = np.divide.outer(lengths, lengths)
    # I guess we should not take into account diagonal, so adding this:
    ratios[np.arange(len(ratios)), np.arange(len(ratios))] = 0
    where_is_above_threshold = np.where(ratios >= 0.2)
    row_indices = group_df.index[where_is_above_threshold[0]]
    column_indices = group_df.index[where_is_above_threshold[1]]
    Score_matrix[row_indices, column_indices] = 100

Score_matrix_2D = pd.DataFrame(Score_matrix.todense())
print(Score_matrix_2D)

     0    1      2      3
0  0.0  0.0    0.0    0.0
1  0.0  0.0    0.0    0.0
2  0.0  0.0    0.0  100.0
3  0.0  0.0  100.0    0.0