I am testing the following simple example (see comments in the coding below for background). I have two questions. Thanks.
- How come
b
inbottle
is not updated even though the for loop did calculate the right value? - Is there an easier way to do this without using for loop? I heard that using loop can take a lot of time to run when the data is bigger than this simple example.
test = pd.DataFrame( [[1, 5], [1, 8], [1, 9], [2, 1], [3, 1], [4, 1]], columns=['a', 'b'] ) # Original df bottle = pd.DataFrame().reindex_like(test) # a blank df with the same shape bottle['a'] = test['a'] # set 'a' in bottle to be the same in test print(bottle) a b 0 1 NaN 1 1 NaN 2 1 NaN 3 2 NaN 4 3 NaN 5 4 NaN for index, row in bottle.iterrows(): row['b'] = test[test['a'] == row['a']]['b'].sum() print(row['a'], row['b']) 1.0 22.0 1.0 22.0 1.0 22.0 2.0 1.0 3.0 1.0 4.0 1.0 # I can see for loop is doing what I need. bottle a b 0 1 NaN 1 1 NaN 2 1 NaN 3 2 NaN 4 3 NaN 5 4 NaN # However, 'b' in bottle is not updated by the for loop. Why? And how to fix that? test['c'] = bottle['b'] # This is the end output I want to get, but not working due to the above. Also is there a way to achieve this without using for loop?
CodePudding user response:
When you iterate over the dataframe's rows, your row
variable will be a copy of the current row, local to that for-loop's iteration. When you go to the next iteration, that variable will be deleted, along with the changes you made to it. If you want your for loop to work, you should assign to bottle.loc[index, "b"]
instead of to row["b"]
.
You can complete your task without a for loop by using pandas.DataFrame.groupby
and transform
as follows:
bottle["b"] = test.groupby("a")["b"].transform("sum")
bottle:
a b
0 1 22
1 1 22
2 1 22
3 2 1
4 3 1
5 4 1
CodePudding user response:
The value of b in bottle is not updated because you are not reassigning the value of b in bottle in the loop. Instead, you are only updating the value of b for the current row in the loop.
To fix this, you can modify the code as follows:
for index, row in bottle.iterrows():
bottle.loc[index, 'b'] = test[test['a'] == row['a']]['b'].sum()
This will update the value of b in the bottle DataFrame for the current row in the loop.