Home > Software engineering >  Why does pandas.DataFrame change the data source?
Why does pandas.DataFrame change the data source?

Time:08-12

I'm learning Python, and I found a thing I can't understand.

I created a pandas.DataFrame from a ndarray, and then only modified the DF instead of ndarray.

And to my suprise, the ndarray has changed too! Is the data cached inside DF? If yes, why does they changed inside ndarray? If no, how about a DF created without any source?

from pandas import DataFrame
import numpy as np

if __name__ == '__main__':
    nda1 = np.zeros((3,3), dtype=float)
    print(f'original nda1:\n{nda1}\n')

    df1 = DataFrame(nda1)
    print(f'original df1:\n{df1}\n')

    df1.iat[2,2] = 999
    #print(f'df1 in main:\n{df}\n')
    print(f'nda1 after modify:\n{nda1}\n')

CodePudding user response:

Many programmers experience this. This is because of this line:

df1 = DataFrame(nda1)

When you set these 2 things as equal, both will be intertwined. If you want to have a "no source" dataframe, use:

df2 = df1.copy()

or

df1 = DataFrame(nda1,copy())

High relevant post:

Why can pandas dataframes change each other

CodePudding user response:

DataFrames are using numpy arrays under the hood. As you have a full homogeneous type, the array is kept as is.

You can check it with:

pd.DataFrame(nda1).values.base is nda1
# True

You can force a copy to avoid the issue:

df1 = pd.DataFrame(nda1.copy())

or copy from within the constructor:

df1 = pd.DataFrame(nda1, copy=True)

check that the underlying array is different:

pd.DataFrame(nda1, copy=True).values.base is nda1
# False
  • Related