I just discovered that iterating the rows of a pandas dataframe, and making updates to each row, does not update the dataframe! Is this expected behaviour, or does one need to do something to the row first so the update reflects in the parent dataframe?
(I know one could update the dataframe directly in the loop, my question is about the fact that iterrows() seems to provide copies of the rows rather than references to the actual rows in the dataframe, which seems an odd way to do this).
import pandas as pd
fruit = {"Fruit": ['Apple','Avacado','Banana','Strawberry','Grape'],"Color": ['Red','Green','Yellow','Pink','Green'],
"Price": [45, 90, 60, 37, 49]
}
df = pd.DataFrame(fruit)
for index, row in df.iterrows():
row['Price'] = row['Price'] * 2
print(row['Price']) # the price is doubled here as expected
print(df['Price']) # the original values of price in the dataframe are unchanged
CodePudding user response:
You are storing the changes as row['Price']
but not actually saving it back to the dataframe df
, you can go ahead and test this by using:
id(row) == id(df)
Which returns False
. Also, for better efficiency you shouldn't loop, but rather simply re-assign. Replace the for loop
with:
df['New Price '] = df['Price'] * 2
CodePudding user response:
You are entering the subtleties of copies versus original object. What you update in the loop is a copy of the row, not the original Series.
You should have used a direct access to the DataFrame:
for index, row in df.iterrows():
df.loc[index, 'Price'] = row['Price'] * 2
But the real way to perform such operations should be a vectorial one:
df['Price'] = df['Price'].mul(2)
Or:
df['Price'] *= 2
Output:
Fruit Color Price
0 Apple Red 90
1 Avacado Green 180
2 Banana Yellow 120
3 Strawberry Pink 74
4 Grape Green 98