Well, I am trying to solve this clustering problem that involves the K-means algorithm.
Question: Considering the data inside the file - link below - execute the K-means algorithm where the initial centroids are positioned at: [1,1,1,1],[-1,-1,-1,-1] and [1,-1,1,-1]. What is the position of each centroid after 10 iterations?
My solution that I am not sure about: Basic Code:
kmeans = KMeans(n_clusters = 3 , max_iter= 10, init = np.array([[1, 1, 1, 1],[-1, -1, -1, -1],[1, -1, 1, -1]], np.float64) , random_state = 42)
...
kmeans.cluster_centers_
Answer:
array([[ 1.02575735, -0.00207592, -0.02395886, 0.63623732],
[ 0.10361404, 0.00370027, 0.00669603, -0.03432606],
[ 0.99690983, 0.48052607, 0.94034839, -0.00726928]])
Data: https://drive.google.com/file/d/1DXlFR3Jc5cFiblMxD6Bl7f4p7u_qsX2S/view?usp=sharing
Google Collaborator Full Code: https://colab.research.google.com/drive/1somvP3p7KES0NtBwnLYT6vpqSr3WQfgU?usp=sharing
Note: I think StackOverflow is the best for this., but If anyone knows any community of data scientists where I can ask this kind of question, let me know in the comments.
CodePudding user response:
I used my own code to check your answer and it was right.
import pandas as pd
import numpy as np
df = pd.read_csv('agrupamento_Q1.csv')
data = df.to_numpy()
centeroids = np.array([[1.0,1.0,1.0,1.0],[-1.0,-1.0,-1.0,-1.0],[1.0,-1.0,1.0,-1.0]])
iterations = 10
for itr in range(iterations):
assign = np.zeros([data.shape[0],],dtype=int)
for i in range(data.shape[0]):
for c in range(1,3):
if np.linalg.norm(data[i]-centeroids[c]) < np.linalg.norm(data[i]-centeroids[assign[i]]):
assign[i]=c
new_cent = np.zeros_like(centeroids)
cent_pop = np.zeros([centeroids.shape[0],])
for i in range(data.shape[0]):
new_cent[assign[i]] =data[i]
cent_pop[assign[i]] =1
for i in range(centeroids.shape[0]):
centeroids[i] = new_cent[i]/cent_pop[i]
print(centeroids)
# [[ 1.02575735 -0.00207592 -0.02395886 0.63623732]
# [ 0.10361404 0.00370027 0.00669603 -0.03432606]
# [ 0.99690983 0.48052607 0.94034839 -0.00726928]]