I have a python script which does the following:
final_df = main_df
final_df['new_column'] = final_df['b']*0.05
Basically I don't want to disturb main_df
dataframe and work on a copy of it. But when run the above script, both final_df
and main_df
affect in similar ways. Why does this happen?
If I want to achieve the required way, how do I proceed?
CodePudding user response:
use copy method:
final_df = main_df.copy()
CodePudding user response:
You have to deep copy it other wise it would be pointing to the same object. So it would be
final_df = main_df.copy() # by default deep copy is done
Here is the link to the documentation.
CodePudding user response:
The reason is when you do final_df = main_df
, a copy of main_df
is not made but just a new reference final_df
is created thereby both final_df
and main_df
referring to the same memory location. So if one is changed like final_df
in your case, the changes are also reflected into main_df
as both are pointing to same memory location.
Example:
main_df = ['final', 'df']
final_df = main_df
print (f'Location of final_df: {id(final_df)}')
print (f'Location of main_df: {id(main_df)}')
Both the above print
statements will print the same memory location.
Here is the nice writeup to understand this behavior.
If you don't want your main_df
not to be affected, create a deepcopy of it as below:
final_df = main_df.copy()
with copy()
a complete new copy is created and can be verified with below code:
from copy import copy
main_df = ['final', 'df']
final_df = main_df.copy()
print (f'Location of final_df: {id(final_df)}')
print (f'Location of main_df: {id(main_df)}')
Now both the print
statements will print 2 different memory locations.