Home > Blockchain >  Reduce the time complexity while populating data into a matrix python
Reduce the time complexity while populating data into a matrix python

Time:11-12

I have a core_df dataframe. Each row has to be compared against other and the result has to be stored in a matrix. The conditions are if the start or end nodes do not match or if the length ratio is less than 0.2, 0 has to be populated else some value(say 100). The problem is the dataframe can be huge. It can result in millions of records. What I have done to save memory is :

  • Initialise a sparse dok matrix
  • Populate values only in upper diagnol matrix( as it will be mirror image for the lower)
  • changed the datatype of the matrix so that it consumes less bytes. Any other suggestions to improve the memory allocation would be welcome.

I have tried the below code, though it is returning result, its taking huge amount of time on large dataset. Is there a way to optimise the time complexity?

import pandas as pd
import scipy
from scipy.sparse import dok_matrix
core_df = pd.DataFrame({'sno':[1,2,3,4],'start':[1,3,5,5],'end':[5,14,17,27],'start_elements':[('A', 'B'),('X','Y'),('B', 'C'),('B', 'C')],'end_elements':[('L', 'M'),('S', 'T'),('N', 'P'),('N', 'P')]})
 
core_df_count=len(core_df)    
Score_matrix = dok_matrix((core_df_count, core_df_count))

for i in range(0, core_df_count):
    start1 = core_df.iloc[i, 1]
    end1 = core_df.iloc[i, 2]
    Start_elements1 = core_df.iloc[i, 3]
    End_elements1 = core_df.iloc[i, 4]

    for j in range(i   1, core_df_count):
        start2 = core_df.iloc[j, 1]
        end2 = core_df.iloc[j, 2]
        Start_elements2 = core_df.iloc[j, 3]
        End_elements2 = core_df.iloc[j, 4]

        if Start_elements1 != Start_elements2 or End_elements1 != End_elements2:
            Score_matrix[i, j] = 0

        else:
            ratio = (end1 - start1) / (end2 - start2)
            if ratio <0.2:
                Score_matrix[i, j] = 0
            else:
                Score_matrix[i,j]=100
Score_matrix_2D = pd.DataFrame(Score_matrix.todense())
Score_matrix_2D 

CodePudding user response:

  1. First I group df based on start_elements & end_elements and search for only such
  2. In each group I calculate lengths and make dense matrix comparing lengths (ratio).
  3. I search for which df indices ratio is >= 0.2
  4. I do not assign 0 to sparse matrix, just 100 where it is necessary. There is 0 everywhere by default
import numpy as np
import pandas as pd
from scipy.sparse import dok_matrix

core_df = pd.DataFrame({'sno': [1, 2, 3, 4], 'start': [1, 3, 5, 5], 'end': [5, 14, 17, 27],
                        'start_elements': [('A', 'B'), ('X', 'Y'), ('B', 'C'), ('B', 'C')],
                        'end_elements': [('L', 'M'), ('S', 'T'), ('N', 'P'), ('N', 'P')]})

core_df_count = len(core_df)
Score_matrix = dok_matrix((core_df_count, core_df_count))
core_df_values = core_df.values

for (start_elements, end_elements), group_df in core_df.groupby(['start_elements', 'end_elements']):
    lengths = (group_df['end'] - group_df['start']).values
    ratios = np.divide.outer(lengths, lengths)
    # I guess we should not take into account diagonal, so adding this:
    ratios[np.arange(len(ratios)), np.arange(len(ratios))] = 0
    where_is_above_threshold = np.where(ratios >= 0.2)
    row_indices = group_df.index[where_is_above_threshold[0]]
    column_indices = group_df.index[where_is_above_threshold[1]]
    Score_matrix[row_indices, column_indices] = 100

Score_matrix_2D = pd.DataFrame(Score_matrix.todense())
print(Score_matrix_2D)
     0    1      2      3
0  0.0  0.0    0.0    0.0
1  0.0  0.0    0.0    0.0
2  0.0  0.0    0.0  100.0
3  0.0  0.0  100.0    0.0
  • Related