ZeroDivisionError when finding average-CodePudding

I've been trying to write my own k-mean algorithm for the past couple of days, but I have hit a roadblock. When ever I try to find the average location of the points in a cluster to move the centroid, I get a zero division error(Note: this doesn't happen when k = 2, and only happens sometimes when k = 3, but always happens when k >= 4). I tried to fix this by making sure each centroid starts off at one of the points in the dataset, so that it always has at least one point in its cluster, but it hasn't worked. I've also rearranged counters and such, but again, it did not work. I have run out of ideas and am not sure why this error still happens. I'm pretty sure the problem is coming from one of these functions(edit: added all of code and full error message):

import random
import math
import matplotlib.pyplot as plt


class Kmeans:
    def __init__(self, K, dataset, centroids, sorting):
        self.K = K
        self.dataset = dataset
        self.centroids = centroids
        self.sorting = sorting

    def initializeCentroids(self):
        usedPoints = [random.choice(data_set)]
        self.centroids = []
        for q in range(self.K):
            pointSelected = False
            while not pointSelected:
                m = random.choice(data_set)
                print(m)
                print(usedPoints)
                distance = math.sqrt(abs(((m[0] - usedPoints[len(usedPoints) - 1][0]) ** 2)   (m[1] - usedPoints[len(usedPoints) - 1][1]) ** 2))
                if usedPoints.count(m) == 0 and distance > 50:
                    self.centroids.append(list(m))
                    usedPoints.append(m)
                    pointSelected = True
        return self.centroids

    def calcDistance(self):
        self.sorting = []
        for w in self.dataset:
            distances = []
            counter = -1
            for centr in self.centroids:
                counter  = 1
                distances.append(math.sqrt(abs((((w[0] - centr[0]) ** 2)   (w[1] - centr[1]) ** 2))))
                for x in range(len(distances)):
                    if len(distances) > 1:
                        print(distances)
                        if distances[0] > distances[1]:
                            distances.pop(0)
                        else:
                            distances.pop(1)
                            counter -= 1
            print(counter)
            self.sorting.insert(0, [w, counter, distances[0]])
        return self.sorting
    # not done

    def find_ME(self):
        counter2 = 0
        for r in self.centroids:
            for t in self.sorting:
                nums = []
                if t[1] == counter2:
                    nums.append(t[2])
                    population = len(nums)
                    error = sum(nums) / population

    def reassignCentroids(self):
        counter3 = 0
        for r in self.centroids:

            positionsX = []
            positionsY = []
            for t in self.sorting:
                if t[1] == counter3:
                    positionsX.append(t[0][0])
                    positionsY.append(t[0][1])
            population = len(positionsY)
            print(population)
            print(self.sorting)
            r[0] = sum(positionsX) / population
            r[1] = sum(positionsY) / population
            counter3  = 1
        return self.centroids

    def checkSimilar(self, prevList):
        list1 = []
        list2 = []
        for u in prevList:
            list1.append(u[1])
        for i in self.sorting:
            list2.append(i[1])
            print(i)
        if list2 == list1:

            return True
        else:
            return False


k = 3
data_set = [(1, 1), (1, 2), (1, 3), (2, 3), (50, 52), (48, 50), (47, 60), (112, 90), (120, 100), (108, 130), (102, 121), (43, 51), (0, 1)]
attempt = Kmeans(k, data_set, [], [])

attempt.initializeCentroids()

xvals = []
yvals = []
sortCompare = []
maxIterations = 100000
# plots

for p in data_set:
    xvals.append(p[0])
    yvals.append(p[1])


running = True
zeroError = True

while running:
    attempt.calcDistance()
    sortCompare = attempt.sorting
    print(sortCompare, "thisss")
    attempt.reassignCentroids()
    attempt.calcDistance()
    attempt.reassignCentroids()
    boolVal = attempt.checkSimilar(sortCompare)
    if boolVal or maxIterations <= 0:
        xs = []
        ys = []
        for y in attempt.centroids:
            xs.append(y[0])
            ys.append(y[1])
        plt.scatter(xs, ys)
        running = False
    else:
        sortCompare = []

    maxIterations -= 1
    print(attempt.sorting)
print(attempt.centroids)
plt.scatter(xvals, yvals)
plt.show()

Full error: Traceback (most recent call last): File "C:/Users/Jack Cramer/PycharmProjects/kmeans/main.py", line 117, in attempt.reassignCentroids() File "C:/Users/Jack Cramer/PycharmProjects/kmeans/main.py", line 73, in reassignCentroids r[0] = sum(positionsX) / population ZeroDivisionError: division by zero

If you know why this is happening please let me know, thanks for any advice.

CodePudding user response：

As @thierry-lathuille pointed out, the error occurs in reassignCentroids when you divide by population which is zero. population is set to be the length of positionsY, so we need to see which scenarios cause positionsY to have no elements.

positionsY has its values appended inside a loop over the values t in self.sorting. A value is only appended if counter3 (which ranges from 0 to K-1) matches t[1]. So if there is a value of counter3 for which none of the t[1] values are equal, we'll get the error. To help debug, I've added some print statements to the loop

            print(f"{counter3=}")
            for t in self.sorting:
                print(f"{t[1]=}")
                if t[1] == counter3:
                    positionsX.append(t[0][0])
                    positionsY.append(t[0][1])

Running this several times with k=2, I see that t[1] is either 0 or 1, which won't crash, as you see. However, going up to k=3, sometimes I get t[1] equal to 0, 1, or 2, but other times it only equals 0 or 1. This means there will be no matches, and you'll divide by zero. Going up to k=4, we should get t[1] equal to 0, 1, 2, or 3, but I'm not seeing any 3's after many restarts.

From context, I'm guessing that t[1] represents the cluster that a given data point is closest to. From your inputs, it appears by eye that there are three clusters of points So your code naturally has three of the predicted clusters trend towards the places they should be, so if you're trying to predict 4 clusters, you'll always crash. If you're only trying to predict 3 clusters, you might get unlucky and one of your predicted clusters isn't the closest to any point. It wouldn't surprise me if even k=2 errored out for certain initializations, but it must be rare in practice.

I think you need to first ask yourself what behavior you want to see when k is higher than the real number of clusters. And second you need to determine how to update a cluster's location when none of the data points are closest to it. A quick workaround would be to do nothing if population == 0, though that may not give the behavior you want.