Multiindexed columns with missing levels in Pandas-CodePudding

I'm trying to use a column-multiindexed dataframe, but I don't quite get how to handle columns with missing levels. As a MWE, I create a dummy dataframe like this

In [4]: df = pd.DataFrame(
   ...:     np.random.rand(2, 4),
   ...:     columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
   ...: )
   ...: df[('C', np.nan)] = 3
   ...: df
Out[4]: 
          A                   B             C
        1.0       2.0       1.0       2.0 NaN
0  0.484498  0.218928  0.480720  0.530619   3
1  0.789152  0.687612  0.953487  0.926798   3

Notice the last line of the code, which creates a new column with a single level and a padding nan (I do this because that's what pd.MultiIndex.from_tuples does when passed tuples of different length).

I can access the first column with no problems,

In [5]: df[('A', 1)]
Out[5]: 
0    0.484498
1    0.789152
Name: (A, 1.0), dtype: float64

but when I try to access the last column in the exact same way it was created

df[('C', np.nan)]

I get a KeyError

In [6]: df[('C', np.nan)]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 df[('C', np.nan)]

File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/core/frame.py:3504, in DataFrame.__getitem__(self, key)
   3502 if is_single_key:
   3503     if self.columns.nlevels > 1:
-> 3504         return self._getitem_multilevel(key)
   3505     indexer = self.columns.get_loc(key)
   3506     if is_integer(indexer):

File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/core/frame.py:3555, in DataFrame._getitem_multilevel(self, key)
   3553 def _getitem_multilevel(self, key):
   3554     # self.columns is a MultiIndex
-> 3555     loc = self.columns.get_loc(key)
   3556     if isinstance(loc, (slice, np.ndarray)):
   3557         new_columns = self.columns[loc]

File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/core/indexes/multi.py:2882, in MultiIndex.get_loc(self, key, method)
   2880 if keylen == self.nlevels and self.is_unique:
   2881     try:
-> 2882         return self._engine.get_loc(key)
   2883     except TypeError:
   2884         # e.g. test_partial_slicing_with_multiindex partial string slicing
   2885         loc, _ = self.get_loc_level(key, list(range(self.nlevels)))

File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/_libs/index.pyx:779, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc()

File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:1832, in pandas._libs.hashtable.UInt64HashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:1841, in pandas._libs.hashtable.UInt64HashTable.get_item()

KeyError: 12

Any clue?

Cheers.

CodePudding user response：

One hack with replace missing values in columns by empty string:

print (df.rename(columns= lambda x: '' if pd.isna(x) else x)[('C','')])
0    3
1    3
Name: (C, ), dtype: int64

CodePudding user response：

You can use just df['C'], this will select that series:

   NaN
0    3
1    3

Or df['C'][np.nan] to get just the values:

0    4
1    4
Name: nan, dtype: int64

(NB: This only works if there is only one NaN column under C in the multi-index, otherwise you get ValueError: cannot handle a non-unique multi-index!)

CodePudding user response：

Not exactly what I was looking for, but I'm finally going with the solution posted here. Essentially, replacing the np.nans with empty strings is easier to handle. At dataframe creation,

df[('C', '')] = 3

and then

In [7]: df[('C', '')]
Out[7]: 
0    3
1    3
Name: (C, ), dtype: int64