Background -
I have a pandas DataFrame, containing data that I would like to do some calculation on. Specifically, I need to iterate through every row of the DataFrame and compute a calculation, which if in Excel, would be =SUMIF
.
Calculation (=SUMIFS([% Ownership],[Entity ID],[@[Entity ID]])
) -
In this calculation, the integer value from the Entity ID
column is taken, and for every instance of that Entity ID
located in the Entity ID
column throughout the DataFrame, it's corresponding % Ownership
value is summed and printed in the Entity ID % Column
.
>>> df
Entity ID % Ownership Entity ID %
0 12345 100.00 1.000000
1 45643 49.56 0.495551
2 00000 100.00 2.000000
3 00000 100.00 2.000000
In the above example:
12345
only appears once in the DataFrame, therefore itsEntity ID
value is 1.000000.45643
only appears once in the DataFrame (with a 49.56 value), therefore itsEntity ID
value is 0.495551.00000
appears twice in the DataFrame (both instance with 1.0 values) therefore itsEntity ID
value is 2.000000.
Useful info:
- You will note that the data formatting in the
% Ownership
andEntity ID %
columns is different (e.g., 100.00 vs 1.00000), this is how the cells are formatted in my Excel workbook, and I would ideally like to use this formatting and decimal rounding if possible. - I would like to understand how to iterate through every row in my DataFrame (it's 4000 rows) and perform such a calculation using a for loop, so to effectively perform this calculation for every row.
CodePudding user response:
We can use a groupby
to get the sum and assign
the result directly into the DataFrame like so :
>>> df.assign(Entity_ID_perc = df.groupby('Entity ID')['% Ownership'].sum() / 100)
Entity ID % Ownership Entity_ID_perc
0 12345 100.00 1.0000
1 45643 49.56 0.4956
2 00000 100.00 2.0000
3 00000 100.00 2.0000