calculate the score with respect to the group it was assigned or other-CodePudding

I'm beginner in python, I have two dataframes as below. The first dataframe represents the user with their vectors and group number.

    df1 = pd.DataFrame({'user': ['user 1', 'user 2', 'user 3', 'user 4', 'user 5'], 'x1': [[0.2, 0.3, 0.5],[0.3, 0.3, 0.4],[0.4, 0.4, 0.2],[0.2, 0.1, 0.7],[0.5,0.3,0.2]],'group': [1, 0, 0, 2, 1]})

df1

output:

 user   x1  group
    0   user 1  [0.2, 0.3, 0.5] 1
    1   user 2  [0.3, 0.3, 0.4] 0
    2   user 3  [0.4, 0.4, 0.2] 0
    3   user 4  [0.2, 0.1, 0.7] 2
    4   user 5  [0.5, 0.3, 0.2] 1

the second dataframe represents the group number with its vector and variable (p2) and its threshold

df2 = pd.DataFrame({'group': [0, 1, 2],
                   'x2': [[0.4, 0.2, 0.4],[0.5, 0.1, 0.4], [0.5, 0.1, 0.4]],
                   'p2': [0.231, 0.342, 0.411],
                   'threshold': [0.9, 0.6, 0.8]})
df2

output:

group   x2  p2  threshold
0   0   [0.4, 0.2, 0.4] 0.231   0.9
1   1   [0.5, 0.1, 0.4] 0.342   0.6
2   2   [0.5, 0.1, 0.4] 0.411   0.8

I am trying to calculate for each user, the score (S) with respect to the group it was assigned by using:

where k= group size and T is the transport matrix of (x2 -x1).

Then check:

1- If the score is below the threshold of its group, the user does not change its group.

2- If the score is higher than the threshold then we calculate the score for the other groups and will assign the user to the group for each the score is below the threshold. In case this is true for more than one group we will assign the user to the group for which the score is lowest.

3- If the score is above the threshold for all the groups then this user will give the start of a new group.

For example, for user 1 that belongs to group 1.

x2 =(0.5, 0.1, 0.4)
x1 =(0.2, 0.3, 0.5)
So x2 -x1= (0.3, -0.2, -0.1)

then transport of this vector is

(0.3,
-0.2,
-0.1)

so multiplying the transport of this vector by (x2 -x1) is equal to:

 (0.9  0.4 0.1)= 0.14

K= 2
p2 of its cluster=0.342

The score (S) for user 1:

= 1/2 (0.14/0.342)= 0.5 ( 0.4093)= 0.90

We can see that score of user 1 is higher than its group threshold (0.6) So, we need to calculate the score for the other groups (0 and 2) and will assign the user to the group for each score below the threshold. How could I do that for all users?

CodePudding user response：

First, count up the members of each group to get the k term:

df2['count'] = df1.groupby('group')['user'].count()

Then merge df1 and df2 so that we have a frame with all necessary parameters for each user in a single row:

joined = df1.join(df2[['x2', 'p2', 'threshold', 'count']], on='group')
print(joined)

>>>      user               x1  group               x2     p2  threshold  count
0  user 1  [0.2, 0.3, 0.5]      1  [0.5, 0.1, 0.4]  0.342        0.6      2
1  user 2  [0.3, 0.3, 0.4]      0  [0.4, 0.2, 0.4]  0.231        0.9      2
2  user 3  [0.4, 0.4, 0.2]      0  [0.4, 0.2, 0.4]  0.231        0.9      2
3  user 4  [0.2, 0.1, 0.7]      2  [0.5, 0.1, 0.4]  0.411        0.8      1
4  user 5  [0.5, 0.3, 0.2]      1  [0.5, 0.1, 0.4]  0.342        0.6      2

Now define functions to calculate the S score:

def l_delta(z1, z2):
    return [a1 - a2 for (a1, a2) in zip(z1, z2)]

def inner(z1, z2):
    return sum([a1 * a2 for (a1, a2) in zip(z1, z2)])

def s_score(row):
    delta = l_delta(row['x2'], row['x1'])
    num = inner(delta, delta)
    return 1/row['count']   num / row['p2']

Finally, apply these functions to each row in the joined matrix:

joined['s_score'] = joined.apply(s_score, axis=1)
print(joined[['user', 's_score']])

Result:

     user   s_score
0  user 1  0.909357
1  user 2  0.586580
2  user 3  0.846320
3  user 4  1.437956
4  user 5  0.733918

CodePudding user response：

Similar answer to @The Photon, where we (1) merge df1 and df2, (2) calculate k with groupby (3) calculate (x2-x1) inner product with itself

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'user': ['user 1', 'user 2', 'user 3', 'user 4', 'user 5'],
                    'x1': [[0.2, 0.3, 0.5],[0.3, 0.3, 0.4],[0.4, 0.4, 0.2],[0.2, 0.1, 0.7],[0.5,0.3,0.2]],
                    'group': [1, 0, 0, 2, 1]})

df2 = pd.DataFrame({'group': [0, 1, 2],
                    'x2': [[0.4, 0.2, 0.4],[0.5, 0.1, 0.4], [0.5, 0.1, 0.4]],
                    'p2': [0.231, 0.342, 0.411],
                    'threshold': [0.9, 0.6, 0.8]})

#merge df1 and df2 into a single table
merged_df = df1.merge(df2)

#calculate the number of unique users per group (k)
merged_df['k'] = merged_df.groupby('group')['user'].transform('nunique')

#calculate x2-x1 for each user (convert to numpy array for vectorized subtraction)
x2_sub_x1 = merged_df['x2'].apply(np.array)-merged_df['x1'].apply(np.array)

#calculate (x2-x1)T(x2-x1) for each user (same as squaring each term and summing)
numerator = x2_sub_x1.pow(2).apply(sum)

#calculate S from your formula and add it as a column to the merged table
merged_df['S'] = (1/merged_df['k']) (numerator/merged_df['p2'])

Final merged table

    user    x1  group   x2  p2  threshold   k   S
0   user 1  [0.2, 0.3, 0.5] 1   [0.5, 0.1, 0.4] 0.342   0.6 2   0.909357
1   user 5  [0.5, 0.3, 0.2] 1   [0.5, 0.1, 0.4] 0.342   0.6 2   0.733918
2   user 2  [0.3, 0.3, 0.4] 0   [0.4, 0.2, 0.4] 0.231   0.9 2   0.586580
3   user 3  [0.4, 0.4, 0.2] 0   [0.4, 0.2, 0.4] 0.231   0.9 2   0.846320
4   user 4  [0.2, 0.1, 0.7] 2   [0.5, 0.1, 0.4] 0.411   0.8 1   1.437956