I am working with a DataFrame like this:
df=pd.DataFrame({'ID':['12345','55689','56964','49649','89645','0001',
'033','03330','064963','306193','03661','1666'],
'Culture':['A','A','A','A','A','A','B','B','B','B','B','B'],
'H': [30,42,25,32,12,10,4,6,5,10,24,21],
'S':[10,76,100,23,65,94,67,24,67,54,87,81],
'mean': [23,78,95,52,60,76,68,92,34,76,34,12]})
And first I selected just one group by df_1=df.loc[(df['Culture']=='A')
to do kmeans like this
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=1)
kmeans_predict = km.predict(m)
array([0, 2, 1, 1, 0, 0], dtype=int32)
clusters = {}
n = 0
for item in kmeans_predict:
if item in clusters:
clusters[item].append(list_x1[n])
else:
clusters[item] = [list_x1[n]]
n =1
And I got something like this after more code:
ID Culture S mean Cluster
12345 A 10 23 0
55689 A 76 78 2
56964 A 100 95 1
49649 A 23 52 1
89645 A 65 60 0
00001 A 94 92 0
My goal is do kmeans to every group in this dataframe, but I do not want to do all this group by group (Culture, because there are more than 75 groups). I tried something like:
def cluster(X):
k_means = KMeans(n_clusters=3).fit(m).groupby('CUL')
X['cluster'] = k_means.labels_
return X
df= cities_e.groupby('CUL').apply(cluster)
Trying to have all this clustering inside each group by 'Culture' and get it's predicted cluster in the DataFrame.
CodePudding user response:
You could simply wrap your code in a function and use groupby.apply
. However, to get the indexes return a Series, instead of an array:
from sklearn.cluster import KMeans
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m)
return pd.Series(kmeans_predict, index=df_1.index)
df['Cluster'] = df.groupby('Culture').apply(get_cluster).droplevel(0)
Output:
ID Culture H S mean Cluster
0 12345 A 30 10 23 2
1 55689 A 42 76 78 0
2 56964 A 25 100 95 1
3 49649 A 32 23 52 2
4 89645 A 12 65 60 2
5 0001 A 10 94 76 1
6 033 B 4 67 68 1
7 03330 B 6 24 92 0
8 064963 B 5 67 34 2
9 306193 B 10 54 76 0
10 03661 B 24 87 34 2
11 1666 B 21 81 12 2
If you want distinct cluster number across different Cultures, we could assign a group number for each Culture, then use it to modify cluster numbers:
def get_cluster(df_1):
m=df_1.loc[:,['H','mean','mean']].to_numpy()
km = KMeans(n_clusters=3, init='random', max_iter=100, n_init=1, verbose=0).fit(m)
kmeans_predict = km.predict(m) 3 * df_1['Culture_id'].iat[0]
return pd.Series(kmeans_predict, index=df_1.index)
g = df.groupby('Culture')
df['Culture_id'] = g.ngroup()
df['Cluster'] = g.apply(get_cluster).droplevel(0)
df = df.drop(columns=['Culture_id'])
Output:
ID Culture H S mean Cluster
0 12345 A 30 10 23 0
1 55689 A 42 76 78 1
2 56964 A 25 100 95 1
3 49649 A 32 23 52 0
4 89645 A 12 65 60 2
5 0001 A 10 94 76 2
6 033 B 4 67 68 3
7 03330 B 6 24 92 5
8 064963 B 5 67 34 4
9 306193 B 10 54 76 3
10 03661 B 24 87 34 4
11 1666 B 21 81 12 4