Based on the result of a clustering I have the following data:
timestamp consumption cluster_number
0 0 35.666667 1
1 1 29.352222 1
2 2 24.430000 1
3 3 21.756667 1
4 4 20.345556 1
5 5 19.763333 1
6 6 19.874444 1
7 7 22.078889 1
8 8 28.608889 1
9 9 33.827778 2
10 10 36.414444 2
11 11 38.340000 2
12 12 43.305556 2
13 13 43.034444 2
14 14 39.076667 2
15 15 36.378889 2
16 16 36.171111 2
17 17 40.381111 2
18 18 48.692222 0
19 19 52.330000 0
20 20 50.154444 0
21 21 46.491111 0
22 22 44.014444 0
23 23 40.628889 0
With this clustering, the maximum value (and values close to the maximum value) of the column consumption
is in cluster_number
0, the minimum value (and values close to the minimum value) of the column consumption
is in cluster_number
1 and the rest in cluster_number
2. However, I cannot know beforehand which 'consumption' values correspond to which cluster_number
, so I need to find a way to first connect the cluster_number
with high, low and middle class and then come up with a list of the column timestamp
for each cluster.
Specifically, I want to come up with three lists:
- high = [18, 19, 20, 21, 22, 23]
- low = [0, 1, 2, 3, 4, 5, 6, 7, 8]
- middle = [9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
Any idea of how can I achieve this?
CodePudding user response:
You can use a groupby.min
rank
to identify the order of the clusters, then groupby.agg
on the index and rename using the cluster order:
order = ['low', 'middle', 'high']
g = df.reset_index().groupby('cluster_number')
mapper = (g['consumption'].min() # min for the demo, you can use any function mean/sum/…
.rank(method='dense')
.map(dict(enumerate(order, start=1)))
)
out = g['index'].agg(list).rename(mapper)
Output:
cluster_number
high [18, 19, 20, 21, 22, 23]
low [0, 1, 2, 3, 4, 5, 6, 7, 8]
middle [9, 10, 11, 12, 13, 14, 15, 16, 17]
Name: index, dtype: object