I have the following dataframe:
d_test = {
'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
cluster_number
column contains values from 1
to n
. Some values could have repetition but no missing values are presented. For example above such values are: 1
, 2
, 3
, 4
.
I want to be able to select some value from cluster_number
column and change every occurrence of this value to set of unique values. No missing value should be presented. For example if we select value 2
then desirable outcome for cluster_number
is [1, 2, 3, 3, 5, 1, 4, 6]
. Note we had three 2
in the column. We kept first one as 2
we change next occurrence of 2
to 5
and we changed last occurrence of 2
to 6
.
I wrote code for the logic above and it works fine:
cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
if row['cluster_number'] == cluster_number_to_change:
df_test.loc[index, 'cluster_number'] = i
if first_iter:
i = max_cluster 1
first_iter = False
else:
i = 1
But it is written as for-loop and I am trying understand if can be transformed in form of pandas .apply
method (or any other effective vectorized solution).
CodePudding user response:
Using boolean indexing:
# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()
# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number'] = (
m2.where(m1).cumsum()
.add(df_test['cluster_number'].max())
.convert_dtypes()
)
print(df_test)
Output:
random_staff cluster_number
0 gfda 1
1 fsd 2
2 gec 3
3 erw 3
4 gd 5
5 kjhk 1
6 fd 4
7 kui 6