Issue adding new column within def local dataframe-CodePudding

I find a strange behaviour regarding to localised dataframe variable impacting external dataframe (from the def function) when adding new columns in the local dataframe. Wonder if this can be prevented by some sort of settings.

Here is case 1, column C in the external dataframe is expectedly added by local dataframe within the def test:

import pandas as pd

data = pd.DataFrame({'A' : [1, 2, 3, 4],
                     'B' : [1,2,3,4]})

def test(data):
    # These 2 lines below are the only difference 
    # between case 1 & 2
    #data=data   1 
    #data=data - 1

    data['C'] =  data['A']   data['B']
    return data['C']

print("Before running test()")
print(data)
print("Returns from test()")
print(test(data))
print("data after running test()")
print(data)

Before running test()
   A  B
0  1  1
1  2  2
2  3  3
3  4  4
Returns from test()
0    2
1    4
2    6
3    8
Name: C, dtype: int64
data after running test()
   A  B  C
0  1  1  2
1  2  2  4
2  3  3  6
3  4  4  8

In case 1, I thought the data within the def test(data): would have been localised so even if I add a new column, it would not affect the source dataframe outside the function. But the new column C is still added to the dataframe outside the test().

However, if I do modified the local data and then adding new column, the external dataframe is not impacted as in the following case 2:

import pandas as pd

data = pd.DataFrame({'A' : [1, 2, 3, 4],
                     'B' : [1,2,3,4]})

def test(data):
    # These 2 lines below are the only difference 
    # between case 1 & 2
    data=data   1 
    data=data - 1

    data['C'] =  data['A']   data['B']
    return data['C']

print("Before running test()")
print(data)
print("Returns from test()")
print(test(data))
print("data after running test()")
print(data)

Before running test()
   A  B
0  1  1
1  2  2
2  3  3
3  4  4
Returns from test()
0    2
1    4
2    6
3    8
Name: C, dtype: int64
data after running test()
   A  B
0  1  1
1  2  2
2  3  3
3  4  4

I can setup temp dataframe to avoid the problem but it is rather cumbersome. I would very much like to enforce the dataframe to be localised so that when I make changes or adding new column, the source / external dataframe would not be impacted.

CodePudding user response：

The data = data 1 assignment creates a new object, which you can see if you look at id(data) after the call. Without that, it's just modifying the same object in place (as in test1 below):

In [11]: def test1(data):
    ...:     print(id(data))
    ...:     data['C'] = data['A']   data['B']
    ...:     return data['C']

In [12]: data, id(data)
Out[12]:
(   A  B
 0  1  1
 1  2  2
 2  3  3
 3  4  4, 1413911641896)

In [13]: test1(data)
1413911641896
Out[13]:
0    2
1    4
2    6
3    8
Name: C, dtype: int64

In [14]: def test2(data):
    ...:     data = data   1
    ...:     print(id(data))
    ...:     data['D'] = data['A']   data['B']
    ...:     return data['D']

In [15]: test2(data)
1413912402128
Out[15]:
0     4
1     6
2     8
3    10
Name: D, dtype: int64

CodePudding user response：

It seems that in the second case when you add the line

data=data   1

you're creating a new instance of a dataframe modifying that instead of the original. In the second case it's well known you can't modify a dataframe in a function without making a copy of it first, or else you modify the initial df as well