Assume I have data in the form (As a Pandas' Data Frame):
Index | ID | Value | Div Factor | Weighted Sum |
---|---|---|---|---|
1 | 1 | 2 | 1 | |
2 | 1 | 3 | 2 | |
3 | 2 | 6 | 1 | |
4 | 1 | 1 | 3 | |
5 | 2 | 3 | 2 | |
6 | 2 | 9 | 3 | |
7 | 2 | 8 | 4 | |
8 | 3 | 5 | 1 | |
9 | 3 | 6 | 2 | |
10 | 1 | 8 | 4 | |
11 | 3 | 2 | 3 | |
12 | 3 | 7 | 4 |
I want to calculate the column Weighted Sum
as following (For the $i$ -th row):
- Look at all values from row 1 to
i
. - Sum values by groups according to the
ID
value of each row. So we havek
sum values wherek
is the number of uniqueID
value from the row 1 toi
. - Divide each sum (There are
k
sum values) by the number of elements in the group. - Sum those
k
values and divide byk
(The average of the averages).
For example, let's do rows 1, 7 and 12:
Row 1
For i = 1
we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.
Row 7
For i = 7
we have only 2 unique values of ID
above it: 1 and 2.
For the group of ID = 1
we have: (1 3 2) / 3 = 2.
For the group of ID = 2
we have: (8 9 3 6) / 4 = 6.5.
Then the average of averages is (2 6.5) / 2 = 4.25.
Row 12
For i = 12
we have 3 unique ID
values on the rows 1:12
.
For the group of ID = 1
we have: (8 1 3 2) / 4 = 3.5.
For the group of ID = 2
we have: (8 9 3 6) / 4 = 6.5.
For the group of ID = 3
we have: (7 2 6 5) / 4 = 5.
Then the average of averages is (3.5 6.5 5) / 3 = 5.
It is easy to do with a loop but for a large number of rows it is inefficient.
Is there an efficient way to do so? Maybe using apply()
or transform()
?
Remark: The method should be feasible for the case of ~1e7
rows and ~1e6
unique ID's.
CodePudding user response:
This should do what you're asking:
df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
df['Weighted Sum'] = df2.mean(axis=1)
(Simplification of the last line based on comment by @Acccumulation.)
Output:
Index ID Value Div Factor Weighted Sum
0 1 1 2 1 2.000000
1 2 1 3 2 2.500000
2 3 2 6 1 4.250000
3 4 1 1 3 4.000000
4 5 2 3 2 3.250000
5 6 2 9 3 4.000000
6 7 2 8 4 4.250000
7 8 3 5 1 4.500000
8 9 3 6 2 4.666667
9 10 1 8 4 5.166667
10 11 3 2 3 4.777778
11 12 3 7 4 5.000000
Explanation:
- Use
unstack()
to pivot, making uniqueID
column values into a level of column labels with theValue
entry of each row in the column matching theID
value of the original dataframe, and NaN in all other locations - Use
fillna(0)
andcumsum()
to replace each column with its cumulative sum - Use
notnull().astype(int)
andcumsum()
to replace each column with its cumulative count (where NaNs count as 0) - Take the ratio of the cumulative sum to the cumulative count; this is the cumulative average at each successive row of
Value
for each uniqueID
- For each row, take the mean of the non-NaN cumulative averages across all unique
ID
values. This is the result described in the question.
Detailed intermediate results:
Index ID Value Div Factor
0 1 1 2 1
1 2 1 3 2
2 3 2 6 1
3 4 1 1 3
4 5 2 3 2
5 6 2 9 3
6 7 2 8 4
7 8 3 5 1
8 9 3 6 2
9 10 1 8 4
10 11 3 2 3
11 12 3 7 4
df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
Value
ID 1 2 3
0 2.0 NaN NaN
1 3.0 NaN NaN
2 NaN 6.0 NaN
3 1.0 NaN NaN
4 NaN 3.0 NaN
5 NaN 9.0 NaN
6 NaN 8.0 NaN
7 NaN NaN 5.0
8 NaN NaN 6.0
9 8.0 NaN NaN
10 NaN NaN 2.0
11 NaN NaN 7.0
df1.fillna(0).cumsum()
Value
ID 1 2 3
0 2.0 0.0 0.0
1 5.0 0.0 0.0
2 5.0 6.0 0.0
3 6.0 6.0 0.0
4 6.0 9.0 0.0
5 6.0 18.0 0.0
6 6.0 26.0 0.0
7 6.0 26.0 5.0
8 6.0 26.0 11.0
9 14.0 26.0 11.0
10 14.0 26.0 13.0
11 14.0 26.0 20.0
df1.notnull().astype(int).cumsum()
Value
ID 1 2 3
0 1 0 0
1 2 0 0
2 2 1 0
3 3 1 0
4 3 2 0
5 3 3 0
6 3 4 0
7 3 4 1
8 3 4 2
9 4 4 2
10 4 4 3
11 4 4 4
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
Value
ID 1 2 3
0 2.0 NaN NaN
1 2.5 NaN NaN
2 2.5 6.0 NaN
3 2.0 6.0 NaN
4 2.0 4.5 NaN
5 2.0 6.0 NaN
6 2.0 6.5 NaN
7 2.0 6.5 5.000000
8 2.0 6.5 5.500000
9 3.5 6.5 5.500000
10 3.5 6.5 4.333333
11 3.5 6.5 5.000000
result with df['Weighted Sum'] = df2.mean(axis=1)
Index ID Value Div Factor Weighted Sum
0 1 1 2 1 2.000000
1 2 1 3 2 2.500000
2 3 2 6 1 4.250000
3 4 1 1 3 4.000000
4 5 2 3 2 3.250000
5 6 2 9 3 4.000000
6 7 2 8 4 4.250000
7 8 3 5 1 4.500000
8 9 3 6 2 4.666667
9 10 1 8 4 5.166667
10 11 3 2 3 4.777778
11 12 3 7 4 5.000000