How to copy dataframe in pandas-CodePudding

I have a python script which does the following:

final_df = main_df
final_df['new_column'] = final_df['b']*0.05

Basically I don't want to disturb main_df dataframe and work on a copy of it. But when run the above script, both final_df and main_df affect in similar ways. Why does this happen?

If I want to achieve the required way, how do I proceed?

CodePudding user response：

use copy method:

final_df = main_df.copy()

CodePudding user response：

You have to deep copy it other wise it would be pointing to the same object. So it would be

final_df = main_df.copy() # by default deep copy is done

Here is the link to the documentation.

CodePudding user response：

The reason is when you do final_df = main_df, a copy of main_df is not made but just a new reference final_df is created thereby both final_df and main_df referring to the same memory location. So if one is changed like final_df in your case, the changes are also reflected into main_df as both are pointing to same memory location.
Example:

main_df = ['final', 'df']
final_df = main_df
print (f'Location of final_df: {id(final_df)}')
print (f'Location of main_df: {id(main_df)}')

Both the above print statements will print the same memory location.
Here is the nice writeup to understand this behavior.
If you don't want your main_df not to be affected, create a deepcopy of it as below:

final_df = main_df.copy()

with copy() a complete new copy is created and can be verified with below code:

from copy import copy
main_df = ['final', 'df']
final_df = main_df.copy()
print (f'Location of final_df: {id(final_df)}')
print (f'Location of main_df: {id(main_df)}')

Now both the print statements will print 2 different memory locations.