Home > Mobile >  Get boundary coordinates for clusters created from sklearn Gaussian Mixture
Get boundary coordinates for clusters created from sklearn Gaussian Mixture

Time:04-12

I have a dataframe df that I have applied sklearn.mixture.GaussianMixture to in order to cluster my data. It is a relatively simple model:

# Pandas and Numpy
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt

# Gaussian mixture clustering
from sklearn.mixture import GaussianMixture

# Define Colours and labels
colours = ['cyan', 'chartreuse']
lab = ['Segment 1', 'Segment 2',]

# Define dataset
X = df[['weights', 'percentiles']].to_numpy()
# Define the model
gm_model = GaussianMixture(n_components=2)
# Fit the model
gm_model.fit(X)
# Assign a cluster to each example
yhat = gm_model.predict(X)
# Retrieve unique clusters
clusters = np.unique(yhat)

# Create scatter plot for samples from each cluster
for i, cluster in enumerate(clusters):
    # Get row indexes for samples with this cluster
    row_ix = np.where(yhat == cluster)
    # Create scatter of these samples with a different colour and label for each segment
    plt.scatter(X[row_ix, 0], X[row_ix, 1], s=1, c=colours[i], label=lab[i])

lgnd = plt.legend(loc='lower right', scatterpoints=1, fontsize=30)

plt.show()

What I then want to do is take another dataframe df_1 and find which of its values fall into which cluster created from df. Both df and df_1 have the exact same structure:

print(df.columns)

Index(['id', 'percentiles', 'weights', 'is_good'],
      dtype='object')

print(df.dtypes)

id                  object
percentile         float64
weight             float64
is_good             object

So I want to use where df_1['is_good'] == 'Yes' to find the values of df_1 that would fall into the clusters created by df.

I was thinking of doing this by finding the coordinates of the boundaries of each cluster and then just finding all the values in df_1 that were inside those boundaries and tagging those as being within a particular cluster. In order to do that, however, I would need to know how to find the coordinates of the cluster boundaries. Or if there is another (or better) way to do this, I would love to know!

CodePudding user response:

You could just predict in the same way as performed for df:

X = df_1[['weights', 'percentiles']].to_numpy()
prediction = gm_model.predict(X)
  • Related