Hello I am learning how to use the Scikit-learn clustering modules right now. I have a working script that reads in a pandas dataframe.
df=pd.read_csv("test.csv",index_col="identifier")
I converted the dataframe to a numpy array
array=df.to_numpy()
Then implemented the clustering and plotted as so:
km=KMeans(n_clusters=25,init="random",n_init=100,max_iter=1000,tol=1e-04, random_state=0)
##get cluster labels
y_km=km.fit_predict(array)
###To plot use PCA function
pca=PCA(n_components=3)
pca_t=pca.fit_transform(array)
####
u_labels=np.unique(y_km)
fig = plt.figure(figsize=(14,10))
ax = plt.axes(projection='3d')
for i in u_labels:
ax.scatter3D(pca_t[y_km == i , 0] , pca_t[y_km == i , 1],pca_t[y_km == i , 2], label = i)
ax.legend()
This all outputs a plot that looks like this:
I want to try and get a final output that ouputs a dictionary or text file of some sort that tells me what cluster each identifier is in based on the row ids of the original array. I was having trouble figuring out how to maintain that information though. I tried seeing if I could use the pandas Dataframe.to_records() function which maintained the dtypes but couldn't figure out how to translate that to what I wanted.
CodePudding user response:
y_km
contains your labels in the same order as the rows in your pandas dataframe. example:
df = pd.DataFrame({
'foo': ['one', 'one', 'one', 'two', 'two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
},
index = ['x', 'y', 'z', 'q', 'w', 't']
)
y_km = [1, 2, 3, 4, 5, 6]
print(pd.DataFrame(y_km, df.index))
0
x 1
y 2
z 3
q 4
w 5
t 6
CodePudding user response:
You should try :
print(y_km.labels_)
This should give you a list of label for each point.
See the documentation for KMeans.