im looking for an algorithm to get different means for different values. Example: I have the values 1.6, 1.7, 5.6, 5.7, 5,5 So the Output should be 1.65 and 5.7
CodePudding user response:
If you know the "range" around each cluster mean
A possible simple solution: round every value to a multiple of your "range" parameter; group values that are rounded to the same multiple.
To group, you can use a combination of sorted
and itertools.groupby
, or more simply, you can use a dict
of lists
.
from collections import defaultdict
def clusters(data, r):
groups = defaultdict(list)
for x in data:
groups[x // r].append(x)
return groups
def means_of_clusters(data, r):
return [sum(g) / len(g) for g in clusters(data, r).values()]
print( means_of_clusters([1.6, 1.7, 5.6, 5.7, 5.5], 0.4) )
# [1.65, 5.55, 5.7]
Note how 5.7 was separated from 5.5 and 5.6, because 5.5 and 5.6 were rounded to 13*0.4
, whereas 5.7 was rounded to 14*0.4
.
If you know the number of clusters
You mentioned in the comments that there will always be 2 clusters. I suggest just looking for the greatest gap between two consecutive numbers in the sorted list, and splitting on that gap:
def split_in_2_clusters(data):
seq = sorted(data)
split_index = max(range(1, len(seq)), key=lambda i: seq[i] - seq[i-1])
return seq[:split_index], seq[split_index:]
def means_of_2_clusters(data):
return tuple(sum(g) / len(g) for g in split_in_2_clusters(data))
print( means_of_2_clusters([1.6, 1.7, 5.6, 5.7, 5.5]) )
# (1.65, 5.6000000000000005)
For more complex clustering problems
I strongly suggest taking a look at all the clustering algorithms implemented in library scikit-learn. The documentation page lists the algorithms in a nice table that explains which parameters are expected by which algorithm; so you can easily choose the algorithm best-suited to your situation.