How copy method works in pandas dataframe?-CodePudding

when I have dg = df.copy() in pandas dataframe, I know I have two dataframes where dg is copy of df. But for df = df.copy(), does the new df overrides the old df? I mean in the RAM for df = df.copy() how many dataframes I have?

CodePudding user response：

From the official Pandas documentation

DataFrame.copy(deep=True)

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

So when you have df = df.copy(), you are just assigning the df object to the df variable name, so functionally nothing changes. It's overwritten, but stays the same. So you would still only have a single df object stored in memory.

CodePudding user response：

But for df = df.copy(), does the new df overrides the old df? I mean in the RAM for df = df.copy() how many dataframes I have?

This is not a question about Pandas or the DataFrame class. It is a question about the = operator in Python.

df.copy() creates a new object, which happens to be a new instance of the DataFrame class. That's all you have to know. (You do have to know this, because functions can return objects that already existed.) It will do this exactly the same way whether you write dg = df.copy() or df = df.copy() - it could not possibly matter, because there is no way for the method to know that the assignment is even going to happen.

Assignment causes a name to refer to some particular object. That's it. dg = df.copy() means "when you get the object back from df.copy(), let dg be a name for that object". df = df.copy() means "when you get the object back from df.copy(), let df (stop being a name for what it was naming before, and) be a name for that object".

Objects persist for as long as they have a name.

When you write dg = df.copy(), the df name is still a name for the original DataFrame, so now you necessarily have two DataFrames in memory.

When you write df = df.copy(), the df name is not a name for that original DataFrame any more, because it was changed to be a name for the new one. So now the old one may or may not still be in memory.

It will definitely still be in memory if it has any other names (or other references - for example, being an element of a list somewhere).

In the reference implementation, it will be freed up if that was the last remaining name for the object. This happens because the reference implementation uses reference-counting-based garbage collection. Other implementations (for example, Jython) may not do this; they may use any sort of garbage collection technique.