I am beginner in python and have this data frame data that contains samples, values, and cluster numbers for each sample
df = pd.DataFrame({'samples': ['A', 'B', 'C', 'D', 'E'],
'values': [ 0.336663,0.447101,0.402529,0.373014,0.456226],
'cluster': [1, 0, 2, 0, 1]})
df
output:
samples values cluster
0 A 0.336663 1
1 B 0.447101 0
2 C 0.402529 2
3 D 0.373014 0
4 E 0.456226 1
in the following code, it return the max value sample of each cluster. for example for cluster 0, B has the max value among other samples (her B and D). So, it returns the index value for B which is 1, same for cluster 1, we have A and E, and E has max value, so the E index has return, here 4 and etc.
value = [] #list to store the max values
max_value = [] #list to store the max values
clust_max = [] #list to store cluster max
#loop to get the cluster value
tmp=df['values']
clust_labels=df['cluster']
clusters=len(list(set(clust_labels)))
for j in range(clusters):
elems = [i for i, x in enumerate(clust_labels) if x == j] #get samples of cluster k
values = [tmp[elem] for elem in elems] #get values for the sample
max_value_temp = max(values) #get the max value
max_value.append(max_value_temp) #store the max value
max_ind = values.index(max_value_temp) #get the sample with max value
clust_max.append(elems[max_ind]) #store the max value sample
output:
[1, 4, 2]
Want to update this code to return all sample indexes, not only the max values of each cluster.
The expected output:
[0, 1, 2, 3, 4]
CodePudding user response:
I dont really get why you are using a java logic to work with pyhton, probably as mentioned you still new to it. I didnt quiet get what do you expect from the output so I did something according to what I understood.
dfc = pd.DataFrame({'samples': ['A', 'B', 'C', 'D', 'E'],
'values': [ 0.336663,0.447101,0.402529,0.373014,0.456226],
'cluster': [1, 0, 2, 0, 1]})
#get max values by cluster usign groupby
dfmax = dfc.groupby(['cluster']).max()
#insert index as a column using groupby and idxmax function
dfmax['idx'] = dfc.groupby(['cluster']).idxmax()
#you can sort values by two columns in this case values and cluster, or viceversa if you prefer which is a kinda groupby
#you are using java logic and you dont need it in pyhton, there is a pythonic way to code within python
dfsorted = dfc.sort_values(['values','cluster'], ascending=False)