I have a dataframe "df" containing multiple columns, each row is associated to 150 clusters (result of a clustering method). I have extracted from this dataframe random rows which constitute a shorter dataframe "df-new". This new dataframe has 9 clusters repeated over more than 100 rows :
... cluster
0
0
4
95
...
155
98
95
Present cluster number, in order, are : 0,4,8,25,26,95,98,144,175
I would like to create a new column "new" which change for every row the cluster number:
initial new
0 0
4 1
8 2
25 3
How can I iterate this for every row?
CodePudding user response:
You can first get all your selected clusters using :
clusters = df_new["cluster"].unique()
Then, using the argsort function from numpy, you can create a dictionary where the keys will be the cluster number, and the value the rank of this cluster in your sub selection :
mapping = dict(zip(clusters,np.argsort(clusters)))
Now, you can create your new columns with:
df_new["new"] = df["cluster"].apply(lambda x: mapping[x])
OUTPUT:
clusters new
0 0 0
1 0 0
2 4 1
3 95 2
4 155 4
5 98 3
6 95 2