Home > other >  Accessing MultiIndex in a specific format
Accessing MultiIndex in a specific format

Time:07-06

I have the following dataframe,

df = pd.DataFrame(np.random.randint(0,1000, (5,6)), \
                  columns = pd.MultiIndex.from_product([['CPC', 'Conversions'], ['April', 'June', 'May']])).rename_axis(index = {None : 'idx'})

df

      CPC                 Conversions          
      April June  May     April June  May
idx                                      
 0     663  964  971      663   76  927
 1     405  217  754      370  306   34
 2     474  229  664      354   66  885
 3      73  538  139      417  876  855
 4     619  618  618      455  134  805

I wanted to access the months column level in right format so I did this

df.loc[:, (slice(None), ['April', 'May', 'June'])]

        CPC Conversions   CPC Conversions   CPC Conversions
      April       April   May         May  June        June
idx                                                     
 0      806         202   963         975   110          55
 1      263         884   442         563   216         694
 2      462         361   780         412   858         670
 3      742         756   525          33   477         826
 4      579         332    91         802   829         231

It's weird how the result comes while it works perfectly fine when I use two months instead of three (see below)

df.loc[:, (slice(None), ['April', 'May'])]

              CPC      Conversions     
       April  May       April  May
idx                            
 0       856  619         180  593
 1        64  403         929   80
 2       973  285         803  967
 3       769  405         701  267
 4       940  368         863  717

Can anyone tell me what's wrong, it seems buggy to me.

Currently what I'm doing is creating a MultiIndex.from_product (just as I used for creating this example) with ordered months and replace the existing column with it. However, I don't want to do that since the other column levels might change over time.

CodePudding user response:

It's not buggy, it's just unexpected:

>>> df.columns.levels
FrozenList([['CPC', 'Conversions'], ['April', 'June', 'May']])

>>> df.columns.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

>>> df.columns.is_monotonic
True

We can see 'April' has the code 0, 'June' the code 1 and 'May' the code 2.

Now you slice your dataframe by second level:

# Your code (unordered codes)
>>> df.loc[:, (slice(None), ['April', 'May', 'June'])]

  CPC Conversions  CPC Conversions  CPC Conversions
April       April  May         May June        June
...

but if your slice your dataframe according the ordered codes, it works as expected:

>>> df.loc[:, (slice(None), ['April', 'June', 'May'])]
  CPC           Conversions          
April June  May       April June  May
...

Check some properties:

>>> df.loc[:, (slice(None), ['April', 'May', 'June'])].columns.is_monotonic
False

>>> df.loc[:, (slice(None), ['April', 'June', 'May'])].columns.is_monotonic
True

>>> df.loc[:, (slice(None), ['April', 'May', 'June'])].columns.codes
FrozenList([[0, 1, 0, 1, 0, 1], [0, 0, 2, 2, 1, 1]])

>>> df.loc[:, (slice(None), ['April', 'June', 'May'])].columns.codes
FrozenList([[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

If you use sort_index after your slice, the dataframe is sorted as its creation:

>>> df.loc[:, (slice(None), ['April', 'May', 'June'])].sort_index(axis=1)
  CPC           Conversions          
April June  May       April June  May
...

Sorting MultiIndex is based on codes.

CodePudding user response:

You can order your columns by using df.reindex():

cols = ['April', 'May', 'June']
new_cols = df.columns.reindex(cols, level=1)
df.reindex(columns=new_cols[0])

Without creating new variables:

df.reindex(columns=['April', 'May', 'June'], level=1)
  • Related