I have a pandas
DataFrame with a MultiIndex on the columns and want to access all of the columns of bar
plus one additional column under baz
- how can I achieve that?
Input:
import numpy as np
import pandas as pd
np.random.seed(123)
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
idx = pd.MultiIndex.from_product(iterables, names=["first", "second"])
df = pd.DataFrame(np.random.randn(4, 8), columns=idx)
first bar baz foo qux
second one two one two one two one two
0 -1.085631 0.997345 0.282978 -1.506295 -0.578600 1.651437 -2.426679 -0.428913
1 1.265936 -0.866740 -0.678886 -0.094709 1.491390 -0.638902 -0.443982 -0.434351
2 2.205930 2.186786 1.004054 0.386186 0.737369 1.490732 -0.935834 1.175829
3 -1.253881 -0.637752 0.907105 -1.428681 -0.140069 -0.861755 -0.255619 -2.798589
Expected output:
first bar baz
second one two two
0 -1.085631 0.997345 -1.506295
1 1.265936 -0.866740 -0.094709
2 2.205930 2.186786 0.386186
3 -1.253881 -0.637752 -1.428681
I can specify specific columns like df.loc[:, [("bar", "one"), ("bar", "two"), ("baz", "two")]]
but I can't seem to specify an arbitrary number for only one top-level, like df.loc[:, ["bar", ("baz", "two")]]
CodePudding user response:
iloc
solution
ix = df.columns.get_loc
df.iloc[:, np.r_[ix('bar'), ix(('baz', 'two'))]]
first bar baz
second one two two
0 -1.085631 0.997345 -1.506295
1 1.265936 -0.866740 -0.094709
2 2.205930 2.186786 0.386186
3 -1.253881 -0.637752 -1.428681
CodePudding user response:
If you don't want to key in everything, you can split it up and then use pd.concat
.
bar = df.loc[:, ["bar"]]
baz = df.loc[:, [('baz','two')]]
pd.concat([bar,baz], axis=1)
Out[451]:
first bar baz
second one two two
0 -1.085631 0.997345 -1.506295
1 1.265936 -0.866740 -0.094709
2 2.205930 2.186786 0.386186
3 -1.253881 -0.637752 -1.428681
CodePudding user response:
An alternative route would be to select all of Baz
and bar
and subsequently drop the unwanted column :
df.loc(axis = 1)[['bar','baz']].drop(columns=('baz','one'))
Out[24]:
first bar baz
second one two two
0 -1.085631 0.997345 -1.506295
1 1.265936 -0.866740 -0.094709
2 2.205930 2.186786 0.386186
3 -1.253881 -0.637752 -1.428681
You could also preselect the relevant columns with a list comprehension:
columns = [(left, right)
for left, right in df.columns
if left == 'bar'
or (left, right) == ('baz', 'one')]
df.loc[:, columns]
first bar baz
second one two one
0 -1.085631 0.997345 0.282978
1 1.265936 -0.866740 -0.678886
2 2.205930 2.186786 1.004054
3 -1.253881 -0.637752 0.907105
with pyjanitor's select_columns, one can pass a callable, in this case it mimics the list comprehension above:
df.select_columns(lambda df: df.name[0] == 'bar' or df.name == ('baz', 'one'))
Out[31]:
first bar baz
second one two one
0 -1.085631 0.997345 0.282978
1 1.265936 -0.866740 -0.678886
2 2.205930 2.186786 1.004054
3 -1.253881 -0.637752 0.907105