Home > database >  Apply a Function per Row of Sub Groups of the Data **Above** the Current Row
Apply a Function per Row of Sub Groups of the Data **Above** the Current Row

Time:06-29

Assume I have data in the form (As a Pandas' Data Frame):

Index ID Value Div Factor Weighted Sum
1 1 2 1
2 1 3 2
3 2 6 1
4 1 1 3
5 2 3 2
6 2 9 3
7 2 8 4
8 3 5 1
9 3 6 2
10 1 8 4
11 3 2 3
12 3 7 4

I want to calculate the column Weighted Sum as following (For the $i$ -th row):

  1. Look at all values from row 1 to i.
  2. Sum values by groups according to the ID value of each row. So we have k sum values where k is the number of unique ID value from the row 1 to i.
  3. Divide each sum (There are k sum values) by the number of elements in the group.
  4. Sum those k values and divide by k (The average of the averages).

For example, let's do rows 1, 7 and 12:

Row 1

For i = 1 we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.

Row 7

For i = 7 we have only 2 unique values of ID above it: 1 and 2.
For the group of ID = 1 we have: (1 3 2) / 3 = 2.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
Then the average of averages is (2 6.5) / 2 = 4.25.

Row 12

For i = 12 we have 3 unique ID values on the rows 1:12.
For the group of ID = 1 we have: (8 1 3 2) / 4 = 3.5.
For the group of ID = 2 we have: (8 9 3 6) / 4 = 6.5.
For the group of ID = 3 we have: (7 2 6 5) / 4 = 5.
Then the average of averages is (3.5 6.5 5) / 3 = 5.

It is easy to do with a loop but for a large number of rows it is inefficient. Is there an efficient way to do so? Maybe using apply() or transform()?

Remark: The method should be feasible for the case of ~1e7 rows and ~1e6 unique ID's.

CodePudding user response:

This should do what you're asking:

df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
df['Weighted Sum'] = df2.mean(axis=1)

(Simplification of the last line based on comment by @Acccumulation.)

Output:

    Index  ID  Value  Div Factor  Weighted Sum
0       1   1      2           1      2.000000
1       2   1      3           2      2.500000
2       3   2      6           1      4.250000
3       4   1      1           3      4.000000
4       5   2      3           2      3.250000
5       6   2      9           3      4.000000
6       7   2      8           4      4.250000
7       8   3      5           1      4.500000
8       9   3      6           2      4.666667
9      10   1      8           4      5.166667
10     11   3      2           3      4.777778
11     12   3      7           4      5.000000

Explanation:

  • Use unstack() to pivot, making unique ID column values into a level of column labels with the Value entry of each row in the column matching the ID value of the original dataframe, and NaN in all other locations
  • Use fillna(0) and cumsum() to replace each column with its cumulative sum
  • Use notnull().astype(int) and cumsum() to replace each column with its cumulative count (where NaNs count as 0)
  • Take the ratio of the cumulative sum to the cumulative count; this is the cumulative average at each successive row of Value for each unique ID
  • For each row, take the mean of the non-NaN cumulative averages across all unique ID values. This is the result described in the question.

Detailed intermediate results:

    Index  ID  Value  Div Factor
0       1   1      2           1
1       2   1      3           2
2       3   2      6           1
3       4   1      1           3
4       5   2      3           2
5       6   2      9           3
6       7   2      8           4
7       8   3      5           1
8       9   3      6           2
9      10   1      8           4
10     11   3      2           3
11     12   3      7           4
df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
   Value
ID     1    2    3
0    2.0  NaN  NaN
1    3.0  NaN  NaN
2    NaN  6.0  NaN
3    1.0  NaN  NaN
4    NaN  3.0  NaN
5    NaN  9.0  NaN
6    NaN  8.0  NaN
7    NaN  NaN  5.0
8    NaN  NaN  6.0
9    8.0  NaN  NaN
10   NaN  NaN  2.0
11   NaN  NaN  7.0
df1.fillna(0).cumsum()
   Value
ID     1     2     3
0    2.0   0.0   0.0
1    5.0   0.0   0.0
2    5.0   6.0   0.0
3    6.0   6.0   0.0
4    6.0   9.0   0.0
5    6.0  18.0   0.0
6    6.0  26.0   0.0
7    6.0  26.0   5.0
8    6.0  26.0  11.0
9   14.0  26.0  11.0
10  14.0  26.0  13.0
11  14.0  26.0  20.0
df1.notnull().astype(int).cumsum()
   Value
ID     1  2  3
0      1  0  0
1      2  0  0
2      2  1  0
3      3  1  0
4      3  2  0
5      3  3  0
6      3  4  0
7      3  4  1
8      3  4  2
9      4  4  2
10     4  4  3
11     4  4  4
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
   Value
ID     1    2         3
0    2.0  NaN       NaN
1    2.5  NaN       NaN
2    2.5  6.0       NaN
3    2.0  6.0       NaN
4    2.0  4.5       NaN
5    2.0  6.0       NaN
6    2.0  6.5       NaN
7    2.0  6.5  5.000000
8    2.0  6.5  5.500000
9    3.5  6.5  5.500000
10   3.5  6.5  4.333333
11   3.5  6.5  5.000000
result with df['Weighted Sum'] = df2.mean(axis=1)
    Index  ID  Value  Div Factor  Weighted Sum
0       1   1      2           1      2.000000
1       2   1      3           2      2.500000
2       3   2      6           1      4.250000
3       4   1      1           3      4.000000
4       5   2      3           2      3.250000
5       6   2      9           3      4.000000
6       7   2      8           4      4.250000
7       8   3      5           1      4.500000
8       9   3      6           2      4.666667
9      10   1      8           4      5.166667
10     11   3      2           3      4.777778
11     12   3      7           4      5.000000
  • Related