calculate the average of each dimension defining the group in python-CodePudding

I have a dataframe (df) that has three columns (user, vector, and group name), the vector column with multiple comma-separated values in each row.

df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5',  'user_6'], 'vector': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]], 'group': ['A', 'B', 'C', 'B', 'A', 'A']})

I would like to calculate for each group, the sum of dimensions in all rows divided by the total number of rows for this group.

For example: For group, A is [(1 3 6)/3, (0 8 0)/3, (2 0 0)/3, (0 0 2)/3] = [3.3, 2.6, 0.6, 0.6].

For group, B is [(1 5)/2, (8 0)/2, (0 2)/2, (2 2)/2] = [3,4,1,2].

For group, C is [6, 2, 0, 0]

So, the expected result is an array:

group A: [3.3, 2.6, 0.6, 0.6]
group B: [3,4,1,2]
group C: [6, 2, 0, 0]

CodePudding user response：

I'm not sure if you were looking for the results stored in a single array/dataframe, or if you're just looking to get the results as separate arrays.

If the latter, something like this should work for you:

for group in df.group.unique():
    print(f'Group {group} results: ')
    tmp_df = pd.DataFrame(df[df.group==group]['vector'].tolist())
    print(tmp_df.mean().values)

Output:

Group A results: 
[3.33333333 2.66666667 0.66666667 0.66666667]
Group B results: 
[3. 4. 1. 2.]
Group C results: 
[6. 2. 0. 0.]

It's a little clunky, but gets the job done if you're just looking to get the results.

Filters the dataframe based on group, then turns the vectors of that into it's own tmp_df and gets the mean for each column.

If you want you could easily take those arrays and save them for further manipulation or what have you.

Hope that helps!

CodePudding user response：

Take advantage of numpy:

import numpy as np

out = (df.groupby('group')['vector']
         .agg(lambda x: np.vstack(x).mean(0).round(2))
       )

print(out)

Output:

group
A    [3.33, 2.67, 0.67, 0.67]
B        [3.0, 4.0, 1.0, 2.0]
C        [6.0, 2.0, 0.0, 0.0]
Name: vector, dtype: object

as DataFrame

out = (df.groupby('group', as_index=False)['vector']
         .agg(lambda x: np.vstack(x).mean(0).round(2))
       )

Output:

  group                    vector
0     A  [3.33, 2.67, 0.67, 0.67]
1     B      [3.0, 4.0, 1.0, 2.0]
2     C      [6.0, 2.0, 0.0, 0.0]

as array

out = np.vstack(df.groupby('group')['vector']
                  .agg(lambda x: np.vstack(x).mean(0).round(2))
                )

Output:

[[3.33 2.67 0.67 0.67]
 [3.   4.   1.   2.  ]
 [6.   2.   0.   0.  ]]

CodePudding user response：

So, the expected result is an array:

group A: [3.3, 2.6, 0.6, 0.6]
group B: [3,4,1,2]
group C: [6, 2, 0, 0]