Home > Mobile >  calculation the similarity by using Jaccard Index Python
calculation the similarity by using Jaccard Index Python

Time:03-15

I want to use Jaccard Index to find the similarity among elements of the dataframe (user_choices).

import scipy.spatial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

user_choices = [[1, 0, 0, 1, 0, 1], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]
df_choices = pd.DataFrame(user_choices, columns=["User A", "User B", "User C", "User D", "User E", "User F"], 
                          index=(["User A", "User B", "User C", "User D", "User E", "User F"]))

df_choices

enter image description here

I wrote this code to calculate a Jaccard Index for my data:

jaccard = (1-scipy.spatial.distance.cdist(df_choices, df_choices,  
                                       metric='jaccard'))
user_distance = pd.DataFrame(jaccard, columns=df_choices.index.values,  
                             index=df_choices.index.values)

user_distance

But These are the outputs, which are identical to my data!

enter image description here

I want to Calculate Jaccard similarity (measure of similarity ) not the Jaccard Jaccard index (measure of dissimilarity ) for each element, for example, user A and itself should be 1, etc. Any suggestion is appreciated

CodePudding user response:

If I understand correctly you want user_distance[i,j] = jaccard-distance(df_choices[i], df_choices[j])

You can get this in two steps (1) calculate the pairs distance, this will get the distance for ordered pairs (2) obtain the square form from the condensed distance matrix.

jaccard = scipy.spatial.distance.pdist(df_choices, 'jaccard')
user_distances = pd.DataFrame(1-scipy.spatial.distance.squareform(jaccard), 
                              columns=df_choices.index.values,  
                              index=df_choices.index.values)

You have a symmetric matrix so the distance matrix is expected to be symmetric

For any pair of rows in your matrix there the elements are either all equal or all different, so the output matrix will have only ones and zeros.

if you try the same code with the following example

user_choices = [[1, 0, 0, 3, 0, 4], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]

You will have output different from the input.

CodePudding user response:

  • The Jaccard distance from eg user F with row vector (1, 0, 0, 1, 0, 1) to user A is zero; and you compute 1 - scipy.spatial.distance.cdist(...) = 1.

  • The Jaccard distance from eg. user E with row vector (0, 0, 0, 0, 1, 0) to user A is one; you compute 1 - 1 = 0.

>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[5]))
0.0
>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[4]))
1.0

You have perhaps accidentally arrived at some input that is identical to its own distance matrix when using Jaccard distance as a metric, minus one.

Maybe you don't want that (1-...) there?

  • Related