I encountered following behaviour
s1 = pd.Series([1, 2])
s2 = s1.astype('int64', copy=False)
s2[0] = 10
s1 # note that s1[0] has changed too
0 10
1 2
dtype: int64
where as by changing one to decimal
s1 = pd.Series([1, 2.0])
s2 = s1.astype('int64', copy=False)
s2[0] = 10
s1 # note that s1[0] is expected to change ,but does not
0 1.0
1 2.0
dtype: float64
Just by changing one value to decimal Python ignores the assignment and type conversion !
Did I spent Months learning this unreliable stuff !!!
CodePudding user response:
This is not due to Python's behaviour, this is pandas.
From the pandas documentation:
be very careful setting
copy=False
as changes to values then may propagate to other pandas objects
(Emphasis mine.)
The difference is due to these lines:
if is_dtype_equal(self.dtype, dtype):
# Ensure that self.astype(self.dtype) is self
return self.copy() if copy else self
It's clear that the intended behaviour for astype
to re-use the underlying data rather than make a copy. With a different dtype, pandas can't always just re-use the data (because it's different -- an int64 with the value of 2 will not have the same binary representation as a float64 with the value of 2.0).
CodePudding user response:
The difference between the two is that .astype()
doesn't need to convert the type, if it's already in that type, when you use copy=False
.
When you do this:
s1 = pd.Series([1, 2])
s2 = s1.astype('int64', copy=False)
you now have two Series
objects with the same underlying NumPy array, as you can see here:
>>> s1array = s1.array.to_numpy()
>>> s2array = s2.array.to_numpy()
>>> s1array is s2array, id(s1array), id(s2array)
(True, 123145143806352, 123145143806352)
Since s1
and s2
are references to the same NumPy array, changing one changes the other.
But when you do this:
s1 = pd.Series([1, 2.0])
s2 = s1.astype('int64', copy=False)
s1
is a now type float64
:
>>> s1.dtype
dtype('float64')
so converting its type to int64
creates a new array:
>>> s1array = s1.array.to_numpy()
>>> s2array = s2.array.to_numpy()
>>> s1array is s2array, id(s1array), id(s2array)
(False, 123145146671920, 123145146285680)
and therefore modifying one of them doesn't modify the other.