Let's say we have a series of ordered categorical data:
s = pd.Series(['zero','one','two','three','four','five','six'], dtype='category')
s = s.cat.reorder_categories(['zero','one','two','three','four','five','six'], ordered=True)
print(s)
Here's an output of the code above:
0 zero
1 one
2 two
3 three
4 four
5 five
6 six
dtype: category
Categories (7, object): ['zero' < 'one' < 'two' < 'three' < 'four' < 'five' < 'six']
If I apply clip
with only one of lower=
or upper=
parameter, I get the expected output. But if both of them are passed then the output is somewhat strange:
df = pd.concat([s, s.clip(lower='two'), s.clip(upper='four'), s.clip(lower='two', upper='four')], axis=1)
print(df)
Output, note the last column:
0 1 2 3
0 zero two zero four
1 one two one four
2 two two two four
3 three three three two
4 four four four two
5 five five four two
6 six six four two
I expected to see in the last column a series ['two','two','two','three','four','four',four']
, not the one in the output above.
Also I noticed that replacing of the values for lower=..., upper=...
doesn't change the result. See:
pd.concat([s.clip(lower='two',upper='four'), s.clip(lower='four',upper='two')], axis=1)
with output:
0 1
0 four four
1 four four
2 four four
3 two two
4 two two
5 two two
6 two two
Why clipping of the ordered categorical data behaves like that?
Pandas 1.5.1
Numpy 1.23.4
Python 3.10.7
CodePudding user response:
This looks like a bug. In the source there is a check that the upper/lower bounds are in the correct order. However I think this is mistakenly using the string values of the categories ('four' < 'two'
):
8095 # GH 2747 (arguments were reversed)
8096 if (
8097 lower is not None
8098 and upper is not None
8099 and is_scalar(lower)
8100 and is_scalar(upper)
8101 ):
8102 lower, upper = min(lower, upper), max(lower, upper) # <- HERE
Indeed using zfour
in place of four
:
s = pd.Series(['zero','one','two','three','zfour','five','six'], dtype='category')
s = s.cat.reorder_categories(['zero','one','two','three','zfour','five','six'], ordered=True)
s.clip(lower='two', upper='zfour')
output:
0 two
1 two
2 two
3 three
4 zfour
5 zfour
6 zfour
dtype: category
Categories (7, object): ['zero' < 'one' < 'two' < 'three' < 'zfour' < 'five' < 'six']