Home > Back-end >  A Lexicographical Bug in Pandas?
A Lexicographical Bug in Pandas?

Time:11-11

Please take this question lightly as asked from curiosity:

As I was trying to see how the slicing in MultiIndex works, I came across the following situation ↓

# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)

Returns:

a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

NOTE that the indices are not in the sorted order ie. a, c, b is the order which will result in the expected error that we want while slicing.

# When we do slicing
data.loc["a":"c"]

Errors like:

UnsortedIndexError

----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

That's expected. But now, after doing the following steps:

# Making a DataFrame
data = data.unstack()

# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])

# Which looks like 
   1  2
a  5  0
c  8  6
b  6  3

# Then again making series
data = data.stack()

# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)


# Which looks like before
a  1    5
   2    0
c  1    8
   2    6
b  1    6
   2    3
dtype: int32

The Problem

So, now the process is: Series → Unstack → DataFrame → Stack → Series

Now, if I do the slicing like before (still on with the indices unsorted) we don't get any error!

# The same slicing
data.loc["a":"c"]

Results without an error:

a  1    5
   2    0
c  1    8
   2    6
dtype: int32

Even if the data.index.is_monotonicFalse. Then still why can we slice?

So the question is: WHY?.

I hope you got the understanding of the situation here. Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.

So is that a bug, or a new concept that I am missing here?

Thanks!
Aayush ∞ Shah

UPDATE: I have used the data.reindex() so to unsort that once more. Please have a look at it again.

CodePudding user response:

The difference between you 2 dataframes is the following:

index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

data = pd.Series(np.random.randint(10, size=6), index=index)

data2 = data.unstack().reindex(["a", "c", "b"]).stack()

>>> data.index.codes
FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])

>>> data2.index.codes
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

Even if your two indexes are the same appearance (values), the internal index (codes) are differents.

Check this method of MultiIndex:

        Create a new MultiIndex from the current to monotonically sorted
        items IN the levels. This does not actually make the entire MultiIndex
        monotonic, JUST the levels.

        The resulting MultiIndex will have the same outward
        appearance, meaning the same .values and ordering. It will also
        be .equals() to the original.

Old answer

# Making a DataFrame
data = data.unstack()

# Which looks like         # <- WRONG
   1  2                    #    1  2
a  5  0                    # a  8  0
c  8  6                    # b  4  1
b  6  3                    # c  7  6

# Then again making series
data = data.stack()

# Which looks like before  # <- WRONG
a  1    5                  # a  1    2
   2    0                  #    2    1
c  1    8                  # b  1    0
   2    6                  #    2    1
b  1    6                  # c  1    3
   2    3                  #    2    9
dtype: int32

If you want to use slicing, you have to check if the index is monotonic:

# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)

>>> data.index.is_monotonic
False

>>> data.unstack().stack().index.is_monotonic
True

>>> data.sort_index().index.is_monotonic
True
  • Related