How to optimize my code to calculate Euclidean distance-CodePudding

I am trying to find Euclidean distance between two points. I have around 13000 number of rows in Dataframe. I have to find Euclidean distance for each each row against all 13000 number of rows and then get the similarity scores for that. Running the code is more time consuming (more than 24 hrs).

Below is my code. I am a beginner in this domain.

# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)

# 'i' refers all id's in the dataframe
# Length of df_distance is 13000

for i in tqdm(range(len(df_distance))):
    df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])

    # in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the 
    # comparision from that index of "i" itself.
    if i < len(df_distance):
        index = i

    # This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's 
    for j in (range(len(df_distance))):

        # "a" is the id we are comparing with
        a = df_distance.iloc[i,2:]        

        # "b" is the id we are selecting to compare with
        b = df_distance.iloc[index,2:]

        value = euclidean_dist(a,b)

        # Create a temp dictionary to load the data into dataframe
        dict = {
            'id': df_distance['id'][i], 
            'id_match': df_distance['id'][index], 
            'similarity_distance':value
        }


        df_50 = df_50.append(dict,ignore_index=True)

        # if the b values are less (nearer to the end of the array)
        # in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
        if index == len(df_distance)-1:
            index = 0
        else:
            index  =1

    # Append the content of "df_50" into "df_similar" once for the iteration of "i"
    df_similar = df_similar.append(df_50,ignore_index=True)

I guess more time consuming for me is in the for Loops.

Euclidean distance function I am using in my code.

from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
        euclidean_val = euclidean_distances([a, b])
        value = euclidean_val[0][1]
        return value

Sample df_distance data Note: In the image the values are scaled from column locality till end and we are using only this values to calculate the distance

CodePudding user response：

use cdist,scipy.spatial.distance.cdist is the most intuitive builtin function for this, and far faster than bare numpy.

in the following code v is your array of points and you want to find distance of each point from all the other points of the array v. D will give you what you want

import time
import pandas as pd
import numpy as np
tic = time.time()
from scipy.spatial.distance import pdist, squareform
np.random.seed(0)
v = np.random.rand(6, 2) # v is random set of 6 (x,y) points 
D = cdist(v, v)
toc = time.time()
print(toc-tic)

converting D into a DataFrame using pd.DataFrame(D) will give you this:

          0         1         2         3         4         5
0  0.000000  0.178647  0.143061  0.208694  0.531184  0.306124
1  0.178647  0.000000  0.205629  0.384208  0.395363  0.189637
2  0.143061  0.205629  0.000000  0.246273  0.600408  0.386218
3  0.208694  0.384208  0.246273  0.000000  0.731544  0.507044
4  0.531184  0.395363  0.600408  0.731544  0.000000  0.225209
5  0.306124  0.189637  0.386218  0.507044  0.225209  0.000000

Look, distance from one point to itself is zero that's why the diagonal is 0.

you could also use pdist, squareform and will get the same result.

from scipy.spatial.distance import pdist, squareform
D_cond = pdist(v)
D1 = squareform(D_cond)

both D and D1 distances are in the following format:

        first    second   third
first    d11      d12     d13
second   d21      d22     d23
third    d31      d32     d33

Where, d11, d22, d33 will be zero.

CodePudding user response：

try using numpy instead, do some thing like this:

import pandas as pd
import numpy as np 

def numpy_euclidian_distance(point_1, point_2):
    array_1, array_2 = np.array(point_1), np.array(point_2)
    squared_distance = np.sum(np.square(array_1 - array_2))
    distance = np.sqrt(squared_distance)
    return distance 
    
    
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
 
# # Create DataFrame
df = pd.DataFrame(data)

# calculate distance of the hole number at ones using numpy 
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)