Home > other >  Find the cluster number corresponding to cluster labels
Find the cluster number corresponding to cluster labels

Time:12-07

Based on the result of a clustering I have the following data:

    timestamp  consumption  cluster_number
0           0    35.666667               1
1           1    29.352222               1
2           2    24.430000               1
3           3    21.756667               1
4           4    20.345556               1
5           5    19.763333               1
6           6    19.874444               1
7           7    22.078889               1
8           8    28.608889               1
9           9    33.827778               2
10         10    36.414444               2
11         11    38.340000               2
12         12    43.305556               2
13         13    43.034444               2
14         14    39.076667               2
15         15    36.378889               2
16         16    36.171111               2
17         17    40.381111               2
18         18    48.692222               0
19         19    52.330000               0
20         20    50.154444               0
21         21    46.491111               0
22         22    44.014444               0
23         23    40.628889               0

With this clustering, the maximum value (and values close to the maximum value) of the column consumption is in cluster_number 0, the minimum value (and values close to the minimum value) of the column consumption is in cluster_number 1 and the rest in cluster_number 2. However, I cannot know beforehand which 'consumption' values correspond to which cluster_number, so I need to find a way to first connect the cluster_number with high, low and middle class and then come up with a list of the column timestamp for each cluster.

Specifically, I want to come up with three lists:

  1. high = [18, 19, 20, 21, 22, 23]
  2. low = [0, 1, 2, 3, 4, 5, 6, 7, 8]
  3. middle = [9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

Any idea of how can I achieve this?

CodePudding user response:

You can use a groupby.min rank to identify the order of the clusters, then groupby.agg on the index and rename using the cluster order:

order = ['low', 'middle', 'high']

g = df.reset_index().groupby('cluster_number')
mapper = (g['consumption'].min() # min for the demo, you can use any function mean/sum/…
          .rank(method='dense')
          .map(dict(enumerate(order, start=1)))
         )
out = g['index'].agg(list).rename(mapper)

Output:

cluster_number
high                 [18, 19, 20, 21, 22, 23]
low               [0, 1, 2, 3, 4, 5, 6, 7, 8]
middle    [9, 10, 11, 12, 13, 14, 15, 16, 17]
Name: index, dtype: object
  • Related