I want to use Jaccard Index to find the similarity among elements of the dataframe (user_choices).
import scipy.spatial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
user_choices = [[1, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 1]]
df_choices = pd.DataFrame(user_choices, columns=["User A", "User B", "User C", "User D", "User E", "User F"],
index=(["User A", "User B", "User C", "User D", "User E", "User F"]))
df_choices
I wrote this code to calculate a Jaccard Index for my data:
jaccard = (1-scipy.spatial.distance.cdist(df_choices, df_choices,
metric='jaccard'))
user_distance = pd.DataFrame(jaccard, columns=df_choices.index.values,
index=df_choices.index.values)
user_distance
But These are the outputs, which are identical to my data!
I want to Calculate Jaccard similarity (measure of similarity ) not the Jaccard Jaccard index (measure of dissimilarity ) for each element, for example, user A and itself should be 1, etc. Any suggestion is appreciated
CodePudding user response:
If I understand correctly you want user_distance[i,j] = jaccard-distance(df_choices[i], df_choices[j])
You can get this in two steps (1) calculate the pairs distance, this will get the distance for ordered pairs (2) obtain the square form from the condensed distance matrix.
jaccard = scipy.spatial.distance.pdist(df_choices, 'jaccard')
user_distances = pd.DataFrame(1-scipy.spatial.distance.squareform(jaccard),
columns=df_choices.index.values,
index=df_choices.index.values)
You have a symmetric matrix so the distance matrix is expected to be symmetric
For any pair of rows in your matrix there the elements are either all equal or all different, so the output matrix will have only ones and zeros.
if you try the same code with the following example
user_choices = [[1, 0, 0, 3, 0, 4],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 1]]
You will have output different from the input.
CodePudding user response:
The Jaccard distance from eg user F with row vector (1, 0, 0, 1, 0, 1) to user A is zero; and you compute 1 - scipy.spatial.distance.cdist(...) = 1.
The Jaccard distance from eg. user E with row vector (0, 0, 0, 0, 1, 0) to user A is one; you compute 1 - 1 = 0.
>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[5]))
0.0
>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[4]))
1.0
You have perhaps accidentally arrived at some input that is identical to its own distance matrix when using Jaccard distance as a metric, minus one.
Maybe you don't want that (1-...) there?