Home > front end >  Sorting Algorithm according to match/mismatch
Sorting Algorithm according to match/mismatch

Time:11-15

The data I'm working on represent the presence or the absence of some species in 5 habitats. I want to obtain clusters according to the shared zones between them, basically i want to maximize the matches between elements for each species.
This is the original dataset Initial Dataset

And this is the type of sorting I'm looking for Desired Sorting
I managed to obtain 3 groups by simply sorting on the numbers of occupied zones and then manually fixing evident errors.
This very simplified algorithm only worked because occupied zones are always contiguous, and some common pattern are present.

At first I thought this problem was somewhat similar to ungrouped

grouped

CodePudding user response:

I think I solved this problem in a simple yet appropriate way (I'm sure there may be better algorithms though). I check the number of matches between each couple of boolean vectors:

n=len(diet_bool_df['Prey'])
match_mtx = [[0 for i in range(n)] for j in range(n)] #empty matrix nxn

for i in range(n):
    for j in range(n):
#this gives a vector containing 1 for a match and 0 for a mismatch position-wise
        match_vec = [int(diet_bool_df['Boolean Diet'][i][k]==diet_bool_df['Prey'][j][k]) for k in range(len(diet_bool_df['Prey'][i]))]
#the sum of all 1 gives the number of matches between each couple (i,j)
        match_mtx[i][j] = sum(match_vec)

Since there are 9 preys there will be a maximum of 9 matches, which will indicate identical diets. This new matrix now can be actually clusterized with kmeans because data are isotropic and all assumptions should be met.

def plot_match_cluster():
    fig= go.Figure()
    fig.add_trace(go.Heatmap(
        z=match_mtx,
    ))
    fig.show()

Match Matrix as a heatmap

kmeans = sklearn.cluster.KMeans(n_clusters=3).fit_predict(match_mtx)

The new clusterized data looks like this
Clusterized Data

NB. Interestingly enough the initial sorting I decided according to the number of preys is a good indicator in this scenario to create clusters. This is only true with 3 clusters though.

  • Related