Home > Software engineering >  Group data by a given Range
Group data by a given Range

Time:12-29

im looking for an algorithm to get different means for different values. Example: I have the values 1.6, 1.7, 5.6, 5.7, 5,5 So the Output should be 1.65 and 5.7

CodePudding user response:

If you know the "range" around each cluster mean

A possible simple solution: round every value to a multiple of your "range" parameter; group values that are rounded to the same multiple.

To group, you can use a combination of sorted and itertools.groupby, or more simply, you can use a dict of lists.

from collections import defaultdict

def clusters(data, r):
    groups = defaultdict(list)
    for x in data:
        groups[x // r].append(x)
    return groups

def means_of_clusters(data, r):
    return [sum(g) / len(g) for g in clusters(data, r).values()]

print( means_of_clusters([1.6, 1.7, 5.6, 5.7, 5.5], 0.4) )
# [1.65, 5.55, 5.7]

Note how 5.7 was separated from 5.5 and 5.6, because 5.5 and 5.6 were rounded to 13*0.4, whereas 5.7 was rounded to 14*0.4.

If you know the number of clusters

You mentioned in the comments that there will always be 2 clusters. I suggest just looking for the greatest gap between two consecutive numbers in the sorted list, and splitting on that gap:

def split_in_2_clusters(data):
    seq = sorted(data)
    split_index = max(range(1, len(seq)), key=lambda i: seq[i] - seq[i-1])
    return seq[:split_index], seq[split_index:]

def means_of_2_clusters(data):
    return tuple(sum(g) / len(g) for g in split_in_2_clusters(data))

print( means_of_2_clusters([1.6, 1.7, 5.6, 5.7, 5.5]) )
# (1.65, 5.6000000000000005)

For more complex clustering problems

I strongly suggest taking a look at all the clustering algorithms implemented in library scikit-learn. The documentation page lists the algorithms in a nice table that explains which parameters are expected by which algorithm; so you can easily choose the algorithm best-suited to your situation.

  • Related