I have a dataframe df
that I have applied sklearn.mixture.GaussianMixture
to in order to cluster my data. It is a relatively simple model:
# Pandas and Numpy
import pandas as pd
import numpy as np
# Plotting
import matplotlib.pyplot as plt
# Gaussian mixture clustering
from sklearn.mixture import GaussianMixture
# Define Colours and labels
colours = ['cyan', 'chartreuse']
lab = ['Segment 1', 'Segment 2',]
# Define dataset
X = df[['weights', 'percentiles']].to_numpy()
# Define the model
gm_model = GaussianMixture(n_components=2)
# Fit the model
gm_model.fit(X)
# Assign a cluster to each example
yhat = gm_model.predict(X)
# Retrieve unique clusters
clusters = np.unique(yhat)
# Create scatter plot for samples from each cluster
for i, cluster in enumerate(clusters):
# Get row indexes for samples with this cluster
row_ix = np.where(yhat == cluster)
# Create scatter of these samples with a different colour and label for each segment
plt.scatter(X[row_ix, 0], X[row_ix, 1], s=1, c=colours[i], label=lab[i])
lgnd = plt.legend(loc='lower right', scatterpoints=1, fontsize=30)
plt.show()
What I then want to do is take another dataframe df_1
and find which of its values fall into which cluster created from df
. Both df
and df_1
have the exact same structure:
print(df.columns)
Index(['id', 'percentiles', 'weights', 'is_good'],
dtype='object')
print(df.dtypes)
id object
percentile float64
weight float64
is_good object
So I want to use where df_1['is_good'] == 'Yes'
to find the values of df_1
that would fall into the clusters created by df
.
I was thinking of doing this by finding the coordinates of the boundaries of each cluster and then just finding all the values in df_1
that were inside those boundaries and tagging those as being within a particular cluster. In order to do that, however, I would need to know how to find the coordinates of the cluster boundaries. Or if there is another (or better) way to do this, I would love to know!
CodePudding user response:
You could just predict in the same way as performed for df:
X = df_1[['weights', 'percentiles']].to_numpy()
prediction = gm_model.predict(X)