Home > Back-end >  Series.astype creates a copy when copy=False but dtype is different
Series.astype creates a copy when copy=False but dtype is different

Time:08-27

I encountered following behaviour

s1 = pd.Series([1, 2])
s2 = s1.astype('int64', copy=False)
s2[0] = 10
s1  # note that s1[0] has changed too

0    10
1     2
dtype: int64

where as by changing one to decimal

s1 = pd.Series([1, 2.0])
s2 = s1.astype('int64', copy=False)
s2[0] = 10
s1  # note that s1[0] is expected to change ,but does not

0    1.0
1    2.0
dtype: float64

Just by changing one value to decimal Python ignores the assignment and type conversion !

Did I spent Months learning this unreliable stuff !!!

CodePudding user response:

This is not due to Python's behaviour, this is pandas.

From the pandas documentation:

be very careful setting copy=False as changes to values then may propagate to other pandas objects

(Emphasis mine.)

The difference is due to these lines:

if is_dtype_equal(self.dtype, dtype):
     # Ensure that self.astype(self.dtype) is self
     return self.copy() if copy else self

It's clear that the intended behaviour for astype to re-use the underlying data rather than make a copy. With a different dtype, pandas can't always just re-use the data (because it's different -- an int64 with the value of 2 will not have the same binary representation as a float64 with the value of 2.0).

CodePudding user response:

The difference between the two is that .astype() doesn't need to convert the type, if it's already in that type, when you use copy=False.

When you do this:

s1 = pd.Series([1, 2])
s2 = s1.astype('int64', copy=False)

you now have two Series objects with the same underlying NumPy array, as you can see here:

>>> s1array = s1.array.to_numpy()
>>> s2array = s2.array.to_numpy()
>>> s1array is s2array, id(s1array), id(s2array)
(True, 123145143806352, 123145143806352)

Since s1 and s2 are references to the same NumPy array, changing one changes the other.

But when you do this:

s1 = pd.Series([1, 2.0])
s2 = s1.astype('int64', copy=False)

s1 is a now type float64:

>>> s1.dtype
dtype('float64')

so converting its type to int64 creates a new array:

>>> s1array = s1.array.to_numpy()
>>> s2array = s2.array.to_numpy()
>>> s1array is s2array, id(s1array), id(s2array)
(False, 123145146671920, 123145146285680)

and therefore modifying one of them doesn't modify the other.

  • Related