Home > front end >  return the maximum value of each row with cluster name in dataframe
return the maximum value of each row with cluster name in dataframe

Time:11-16

I I have a pandas dataframe, (df) that has three columns (user, values, and group name), the values column with multiple comma-separated values in each row.

df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5',  'user_6'],
                   'values': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]],
                   'group': ['B', 'A', 'C', 'A', 'B', 'B']})
df

output:

user    values  group
0   user_1  [1, 0, 2, 0]    B
1   user_2  [1, 8, 0, 2]    A
2   user_3  [6, 2, 0, 0]    C
3   user_4  [5, 0, 2, 2]    A
4   user_5  [3, 8, 0, 0]    B
5   user_6  [6, 0, 0, 2]    B

Then I calculate the average of each cluster, which is called a centroid in the dataframe (df1).

df1 = (df.groupby('group', as_index=False)['values']
         .agg(lambda x: np.vstack(x).mean(0).round(2))
       )
df1

Output:

group   values
0   A   [3.0, 4.0, 1.0, 2.0]
1   B   [3.33, 2.67, 0.67, 0.67]
2   C   [6.0, 2.0, 0.0, 0.0]

Finally, I compute the average distance from each user to all clusters in the following code using euclidean distance.

for value in df['values']:
    distance_values = []
    for centroid in df1['values']:
        distance_values.append(distance.euclidean(value, centroid))
    print(distance_values)

Output:

[5.0, 3.8439042651970405, 5.744562646538029]
[4.58257569495584, 6.004631545732011, 8.06225774829855]
[4.242640687119285, 2.9112883745860696, 0.0]
[4.58257569495584, 3.668187563361503, 3.605551275463989]
[4.58257569495584, 5.4236150305861495, 6.708203932499369]
[5.0990195135927845, 4.059014658756482, 2.8284271247461903]

So, for each user, I calculate the average distance to the centroid of each cluster. For example:
For user_1 the average distance to clusters A=5.0, B=3.8439042651970405, and C=5.744562646538029.

How do I return the maximum value of each row in distance values with its cluster name in the dataframe?

For example, the expected output is:

user             max_value    group
0   user_1  5.744562646538029   C
1   user_2  8.06225774829855    C
2   user_3  4.242640687119285   A
3   user_4  4.58257569495584    A
4   user_5  6.708203932499369   C
5   user_6  5.0990195135927845  A

CodePudding user response:

You can use apply to extract max values with their indexes and then use basic string manipulations:

df['distance_values'] = [[5.0, 3.8439042651970405, 5.744562646538029],
[4.58257569495584, 6.004631545732011, 8.06225774829855],
[4.242640687119285, 2.9112883745860696, 0.0],
[4.58257569495584, 3.668187563361503, 3.605551275463989],
[4.58257569495584, 5.4236150305861495, 6.708203932499369],
[5.0990195135927845, 4.059014658756482, 2.8284271247461903]]  

max_df = df['distance_values'].apply(lambda x: [max(x), x.index(max(x))])
df['max_value'] = max_df.str[0]
df['group'] = max_df.str[1].map(dict(zip(range(4), 'ABC')))

CodePudding user response:

max_dist_idx = []
distant_cluster = []

for value in df['values']:
    distance_values = []

    for centroid in df1['values']:
        distance_values.append(distance.euclidean(value, centroid))

    max_dist_idx.append(max(distance_values))
    distant_cluster.append(distance_values.index(max(distance_values)))

cluster_map = {0: 'A', 1: 'B', 2: 'C'}
max_group = [cluster_map[i] for i in distant_cluster]

then you can just mount your dataframe:


pd.DataFrame(data={'user': df.user,
                   'max_value': max_dist_idx,
                   'group': max_group})

   user     max_value    group
0  user_1   5.744563     C
1  user_2   8.062258     C
2  user_3   4.242641     A
3  user_4   4.582576     A
4  user_5   6.708204     C
5  user_6   5.099020     A

CodePudding user response:

You can also include you euclidean distance calculation in the function you'll apply for more efficiency:

def calc_max_dist(value):
    dist_series = df1['values'].apply(lambda x: distance.euclidean(value, x))
    return dist_series.max(), df1[dist_series == dist_series.max()]['group'].values

df[['max_value', 'closest_group(s)']] = pd.DataFrame(df['values'].apply(calc_max_dist).tolist())

Output:

     user        values group  max_value closest_group(s)
0  user_1  [1, 0, 2, 0]     B   5.744563              [C]
1  user_2  [1, 8, 0, 2]     A   8.062258              [C]
2  user_3  [6, 2, 0, 0]     C   4.242641              [A]
3  user_4  [5, 0, 2, 2]     A   4.582576              [A]
4  user_5  [3, 8, 0, 0]     B   6.708204              [C]
5  user_6  [6, 0, 0, 2]     B   5.099020              [A]
  • Related