I am trying to find Euclidean distance between two points. I have around 13000 number of rows in Dataframe. I have to find Euclidean distance for each each row against all 13000 number of rows and then get the similarity scores for that. Running the code is more time consuming (more than 24 hrs).
Below is my code. I am a beginner in this domain.
# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)
# 'i' refers all id's in the dataframe
# Length of df_distance is 13000
for i in tqdm(range(len(df_distance))):
df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])
# in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the
# comparision from that index of "i" itself.
if i < len(df_distance):
index = i
# This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's
for j in (range(len(df_distance))):
# "a" is the id we are comparing with
a = df_distance.iloc[i,2:]
# "b" is the id we are selecting to compare with
b = df_distance.iloc[index,2:]
value = euclidean_dist(a,b)
# Create a temp dictionary to load the data into dataframe
dict = {
'id': df_distance['id'][i],
'id_match': df_distance['id'][index],
'similarity_distance':value
}
df_50 = df_50.append(dict,ignore_index=True)
# if the b values are less (nearer to the end of the array)
# in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
if index == len(df_distance)-1:
index = 0
else:
index =1
# Append the content of "df_50" into "df_similar" once for the iteration of "i"
df_similar = df_similar.append(df_50,ignore_index=True)
I guess more time consuming for me is in the for Loops.
Euclidean distance function I am using in my code.
from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
euclidean_val = euclidean_distances([a, b])
value = euclidean_val[0][1]
return value
Sample df_distance data Note: In the image the values are scaled from column locality till end and we are using only this values to calculate the distance
CodePudding user response:
use cdist
,scipy.spatial.distance.cdist
is the most intuitive builtin function for this, and far faster than bare numpy.
in the following code v is your array of points and you want to find distance of each point from all the other points of the array v. D will give you what you want
import time
import pandas as pd
import numpy as np
tic = time.time()
from scipy.spatial.distance import pdist, squareform
np.random.seed(0)
v = np.random.rand(6, 2) # v is random set of 6 (x,y) points
D = cdist(v, v)
toc = time.time()
print(toc-tic)
converting D into a DataFrame using pd.DataFrame(D)
will give you this:
0 1 2 3 4 5
0 0.000000 0.178647 0.143061 0.208694 0.531184 0.306124
1 0.178647 0.000000 0.205629 0.384208 0.395363 0.189637
2 0.143061 0.205629 0.000000 0.246273 0.600408 0.386218
3 0.208694 0.384208 0.246273 0.000000 0.731544 0.507044
4 0.531184 0.395363 0.600408 0.731544 0.000000 0.225209
5 0.306124 0.189637 0.386218 0.507044 0.225209 0.000000
Look, distance from one point to itself is zero that's why the diagonal is 0.
you could also use pdist, squareform
and will get the same result.
from scipy.spatial.distance import pdist, squareform
D_cond = pdist(v)
D1 = squareform(D_cond)
both D and D1 distances are in the following format:
first second third
first d11 d12 d13
second d21 d22 d23
third d31 d32 d33
Where, d11, d22, d33 will be zero.
CodePudding user response:
try using numpy instead, do some thing like this:
import pandas as pd
import numpy as np
def numpy_euclidian_distance(point_1, point_2):
array_1, array_2 = np.array(point_1), np.array(point_2)
squared_distance = np.sum(np.square(array_1 - array_2))
distance = np.sqrt(squared_distance)
return distance
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
# # Create DataFrame
df = pd.DataFrame(data)
# calculate distance of the hole number at ones using numpy
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)