I I have a pandas dataframe, (df) that has three columns (user, values, and group name), the values column with multiple comma-separated values in each row.
df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5', 'user_6'],
'values': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]],
'group': ['B', 'A', 'C', 'A', 'B', 'B']})
df
output:
user values group
0 user_1 [1, 0, 2, 0] B
1 user_2 [1, 8, 0, 2] A
2 user_3 [6, 2, 0, 0] C
3 user_4 [5, 0, 2, 2] A
4 user_5 [3, 8, 0, 0] B
5 user_6 [6, 0, 0, 2] B
Then I calculate the average of each cluster, which is called a centroid in the dataframe (df1).
df1 = (df.groupby('group', as_index=False)['values']
.agg(lambda x: np.vstack(x).mean(0).round(2))
)
df1
Output:
group values
0 A [3.0, 4.0, 1.0, 2.0]
1 B [3.33, 2.67, 0.67, 0.67]
2 C [6.0, 2.0, 0.0, 0.0]
Finally, I compute the average distance from each user to all clusters in the following code using euclidean distance.
for value in df['values']:
distance_values = []
for centroid in df1['values']:
distance_values.append(distance.euclidean(value, centroid))
print(distance_values)
Output:
[5.0, 3.8439042651970405, 5.744562646538029]
[4.58257569495584, 6.004631545732011, 8.06225774829855]
[4.242640687119285, 2.9112883745860696, 0.0]
[4.58257569495584, 3.668187563361503, 3.605551275463989]
[4.58257569495584, 5.4236150305861495, 6.708203932499369]
[5.0990195135927845, 4.059014658756482, 2.8284271247461903]
So, for each user, I calculate the average distance to the centroid of each cluster.
For example:
For user_1 the average distance to clusters A=5.0, B=3.8439042651970405, and C=5.744562646538029.
How do I return the maximum value of each row in distance values with its cluster name in the dataframe?
For example, the expected output is:
user max_value group
0 user_1 5.744562646538029 C
1 user_2 8.06225774829855 C
2 user_3 4.242640687119285 A
3 user_4 4.58257569495584 A
4 user_5 6.708203932499369 C
5 user_6 5.0990195135927845 A
CodePudding user response:
You can use apply
to extract max values with their indexes
and then use basic string manipulations:
df['distance_values'] = [[5.0, 3.8439042651970405, 5.744562646538029],
[4.58257569495584, 6.004631545732011, 8.06225774829855],
[4.242640687119285, 2.9112883745860696, 0.0],
[4.58257569495584, 3.668187563361503, 3.605551275463989],
[4.58257569495584, 5.4236150305861495, 6.708203932499369],
[5.0990195135927845, 4.059014658756482, 2.8284271247461903]]
max_df = df['distance_values'].apply(lambda x: [max(x), x.index(max(x))])
df['max_value'] = max_df.str[0]
df['group'] = max_df.str[1].map(dict(zip(range(4), 'ABC')))
CodePudding user response:
max_dist_idx = []
distant_cluster = []
for value in df['values']:
distance_values = []
for centroid in df1['values']:
distance_values.append(distance.euclidean(value, centroid))
max_dist_idx.append(max(distance_values))
distant_cluster.append(distance_values.index(max(distance_values)))
cluster_map = {0: 'A', 1: 'B', 2: 'C'}
max_group = [cluster_map[i] for i in distant_cluster]
then you can just mount your dataframe:
pd.DataFrame(data={'user': df.user,
'max_value': max_dist_idx,
'group': max_group})
user max_value group
0 user_1 5.744563 C
1 user_2 8.062258 C
2 user_3 4.242641 A
3 user_4 4.582576 A
4 user_5 6.708204 C
5 user_6 5.099020 A
CodePudding user response:
You can also include you euclidean distance calculation in the function you'll apply for more efficiency:
def calc_max_dist(value):
dist_series = df1['values'].apply(lambda x: distance.euclidean(value, x))
return dist_series.max(), df1[dist_series == dist_series.max()]['group'].values
df[['max_value', 'closest_group(s)']] = pd.DataFrame(df['values'].apply(calc_max_dist).tolist())
Output:
user values group max_value closest_group(s)
0 user_1 [1, 0, 2, 0] B 5.744563 [C]
1 user_2 [1, 8, 0, 2] A 8.062258 [C]
2 user_3 [6, 2, 0, 0] C 4.242641 [A]
3 user_4 [5, 0, 2, 2] A 4.582576 [A]
4 user_5 [3, 8, 0, 0] B 6.708204 [C]
5 user_6 [6, 0, 0, 2] B 5.099020 [A]