I'm trying to use a column-multiindexed dataframe, but I don't quite get how to handle columns with missing levels. As a MWE, I create a dummy dataframe like this
In [4]: df = pd.DataFrame(
...: np.random.rand(2, 4),
...: columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
...: )
...: df[('C', np.nan)] = 3
...: df
Out[4]:
A B C
1.0 2.0 1.0 2.0 NaN
0 0.484498 0.218928 0.480720 0.530619 3
1 0.789152 0.687612 0.953487 0.926798 3
Notice the last line of the code, which creates a new column with a single level and a padding nan (I do this because that's what pd.MultiIndex.from_tuples
does when passed tuples of different length).
I can access the first column with no problems,
In [5]: df[('A', 1)]
Out[5]:
0 0.484498
1 0.789152
Name: (A, 1.0), dtype: float64
but when I try to access the last column in the exact same way it was created
df[('C', np.nan)]
I get a KeyError
In [6]: df[('C', np.nan)]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 df[('C', np.nan)]
File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/core/frame.py:3504, in DataFrame.__getitem__(self, key)
3502 if is_single_key:
3503 if self.columns.nlevels > 1:
-> 3504 return self._getitem_multilevel(key)
3505 indexer = self.columns.get_loc(key)
3506 if is_integer(indexer):
File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/core/frame.py:3555, in DataFrame._getitem_multilevel(self, key)
3553 def _getitem_multilevel(self, key):
3554 # self.columns is a MultiIndex
-> 3555 loc = self.columns.get_loc(key)
3556 if isinstance(loc, (slice, np.ndarray)):
3557 new_columns = self.columns[loc]
File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/core/indexes/multi.py:2882, in MultiIndex.get_loc(self, key, method)
2880 if keylen == self.nlevels and self.is_unique:
2881 try:
-> 2882 return self._engine.get_loc(key)
2883 except TypeError:
2884 # e.g. test_partial_slicing_with_multiindex partial string slicing
2885 loc, _ = self.get_loc_level(key, list(range(self.nlevels)))
File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/_libs/index.pyx:779, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc()
File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()
File ~/anaconda3/envs/dlsproc/lib/python3.10/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:1832, in pandas._libs.hashtable.UInt64HashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:1841, in pandas._libs.hashtable.UInt64HashTable.get_item()
KeyError: 12
Any clue?
Cheers.
CodePudding user response:
One hack with replace missing values in columns by empty string:
print (df.rename(columns= lambda x: '' if pd.isna(x) else x)[('C','')])
0 3
1 3
Name: (C, ), dtype: int64
CodePudding user response:
You can use just df['C']
, this will select that series:
NaN
0 3
1 3
Or df['C'][np.nan]
to get just the values:
0 4
1 4
Name: nan, dtype: int64
(NB: This only works if there is only one NaN column under C in the multi-index, otherwise you get ValueError: cannot handle a non-unique multi-index!
)
CodePudding user response:
Not exactly what I was looking for, but I'm finally going with the solution posted here. Essentially, replacing the np.nan
s with empty strings is easier to handle. At dataframe creation,
df[('C', '')] = 3
and then
In [7]: df[('C', '')]
Out[7]:
0 3
1 3
Name: (C, ), dtype: int64