Is my Implementation of K-means with Manually Set Centroids Correct?-CodePudding

Well, I am trying to solve this clustering problem that involves the K-means algorithm.

Question: Considering the data inside the file - link below - execute the K-means algorithm where the initial centroids are positioned at: [1,1,1,1],[-1,-1,-1,-1] and [1,-1,1,-1]. What is the position of each centroid after 10 iterations?

My solution that I am not sure about: Basic Code:

kmeans = KMeans(n_clusters = 3 , max_iter= 10, init = np.array([[1, 1, 1, 1],[-1, -1, -1, -1],[1, -1, 1, -1]], np.float64) , random_state = 42)  
...
kmeans.cluster_centers_

Answer:

array([[ 1.02575735, -0.00207592, -0.02395886,  0.63623732],
       [ 0.10361404,  0.00370027,  0.00669603, -0.03432606],
       [ 0.99690983,  0.48052607,  0.94034839, -0.00726928]])

Data: https://drive.google.com/file/d/1DXlFR3Jc5cFiblMxD6Bl7f4p7u_qsX2S/view?usp=sharing

Google Collaborator Full Code: https://colab.research.google.com/drive/1somvP3p7KES0NtBwnLYT6vpqSr3WQfgU?usp=sharing

Note: I think StackOverflow is the best for this., but If anyone knows any community of data scientists where I can ask this kind of question, let me know in the comments.

CodePudding user response：

I used my own code to check your answer and it was right.

import pandas as pd
import numpy as np

df = pd.read_csv('agrupamento_Q1.csv')

data = df.to_numpy()
centeroids = np.array([[1.0,1.0,1.0,1.0],[-1.0,-1.0,-1.0,-1.0],[1.0,-1.0,1.0,-1.0]])

iterations = 10

for itr in range(iterations):
    assign = np.zeros([data.shape[0],],dtype=int)
    for i in range(data.shape[0]):
        for c in range(1,3):
            if np.linalg.norm(data[i]-centeroids[c]) < np.linalg.norm(data[i]-centeroids[assign[i]]):
                assign[i]=c
    
    new_cent = np.zeros_like(centeroids)
    cent_pop = np.zeros([centeroids.shape[0],])

    for i in range(data.shape[0]):
        new_cent[assign[i]] =data[i]
        cent_pop[assign[i]] =1
    
    for i in range(centeroids.shape[0]):
        centeroids[i] = new_cent[i]/cent_pop[i]

print(centeroids) 
# [[ 1.02575735 -0.00207592 -0.02395886  0.63623732]
#     [ 0.10361404  0.00370027  0.00669603 -0.03432606]
#     [ 0.99690983  0.48052607  0.94034839 -0.00726928]]