Context
I have a method that takes a Pandas Series of categorial Data and returns it as an indexed version. However, I think my implementation is also modifying the given Series, not just returning a modified new Series. I also get the following Errors:
A value is trying to be set on a copy of a slice from a DataFrame. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy series[series == value] = index
SettingWithCopyWarning: modifications to a property of a datetimelike object are not supported and are discarded. Change values on the original. cacher_needs_updating = self._check_is_chained_assignment_possible()
Code
def categorials(series: pandas.Series) -> pandas.Series:
unique = series.unique()
for index, value in enumerate(unique):
series[series == value] = index
return series.astype(pandas.Int64Dtype())
Question
- How can I achieve my goal: This method should return the modified series without manipulating the original given series?
CodePudding user response:
You need to .copy()
the incoming argument. Normally, that warning wouldn't have appeared; we're at liberty to write to Series/DataFrames after all. However, in the code you didn't share, it seems the argument you're passing here was obtained as a subset of another Series/Frame (or maybe even itself). FYI, if you're planning to do modifications on a subset, better chain .copy()
at the end of initialization.
Anyway, back to the question, series = series.copy()
as the first line in the function should resolve the issue. However, your method is actually doing factorization, so
pd.Series(pd.factorize(series)[0], index=series.index)
is equivalent to what your function does, where since pd.factorize
returns a 2-tuple of (codes, uniques), we take the 0th one. Also it gives a NumPy array back, so we Series-ify it with the incoming index. Noting that, it does not attempt to modify the original Series, so no .copy
is needed for it.