Consider the following DataFrame:
import numpy as np
import pandas as pd
arrays1 = [
[
"A",
"A",
"A",
"B",
"B",
"B",
"C",
"C",
"C",
"D",
"D",
"D",
],
[
"qux",
"quux",
"corge",
"qux",
"quux",
"corge",
"qux",
"quux",
"corge",
"qux",
"quux",
"corge",
],
[
"one",
"two",
"three",
"one",
"two",
"three",
"one",
"two",
"three",
"one",
"two",
"three",
],
]
tuples1 = list(zip(*arrays1))
index_values1 = pd.MultiIndex.from_tuples(tuples1)
df1 = pd.DataFrame(
np.ones((12, 12)), index=index_values1, columns=index_values1
)
Yielding:
A B C D
qux quux corge qux quux corge qux quux corge qux quux corge
one two three one two three one two three one two three
A qux one 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
B qux one 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
C qux one 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
D qux one 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Say I want to set everything to zero, except for the following rows and columns:
A B C D
qux quux corge qux quux corge qux quux corge qux quux corge
one two three one two three one two three one two three
A qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
D qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
That is set everything to zero except for column A and B, and (quux, two), (corge, three) in index C and D.
To select column A and B and (quux, two), (corge, three)
in both index B and C, I expected to be able to do:
l_col_lvl0 = ['A', 'B']
l_idx_lvl0 = ['C', 'D']
l_idx_lvl1 = [("quux", "two"), ("corge", "three")]
df1_a_bc_qc = df1.loc[(l_idx_lvl0, l_idx_lvl1), l_col_lvl0]
However, this returns an empty DataFrame, and the following error message:
FutureWarning: The behavior of indexing on a MultiIndex with a nested sequence of
labels is deprecated and will change in a future version. `series.loc[label, sequence]`
will raise if any members of 'sequence' or not present in the index's second level.
To retain the old behavior, use `series.index.isin(sequence, level=1)`
df1_a_bc_qc = df1.loc[(l_idx_lvl0, l_idx_lvl1), l_col_lvl0]
In turn, I can select column A and B, and indices C and D.
df1_ab_cd = df1.loc[l_idx_lvl0, l_col_lvl0]
A B
qux quux corge qux quux corge
one two three one two three
C qux one 1.0 1.0 1.0 1.0 1.0 1.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0
D qux one 1.0 1.0 1.0 1.0 1.0 1.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0
Moreover, I can select (quux, two), (corge, three) in either index C or D:
df1_ab_c_qc = df1.loc['C',l_col_lvl0].loc[l_idx_lvl1]
A B
qux quux corge qux quux corge
one two three one two three
quux two 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0
df1_ab_d_qc = df1.loc['D',l_col_lvl0].loc[l_idx_lvl1]
A B
qux quux corge qux quux corge
one two three one two three
quux two 1.0 1.0 1.0 1.0 1.0 1.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0
However, if I understand correctly, chained assignments are discouraged.
Moreover, if I try to pass l_idx_lvl0
instead, I get the following error message:
df1_ab_cd_qc = df1.loc[l_idx_lvl0,l_col_lvl0].loc[l_idx_lvl1]
ValueError: operands could not be broadcast together with shapes (2,2) (3,) (2,2)
In conclusion, how can I set verything to 0, except for column A and B, and (quux, two), (corge, three) in index B and C?
I believe that the solution to Question 6 (Select rows in pandas MultiIndex DataFrame) is very close to what I'm looking for, though I haven't gotten it to work for this case.
I would like to be flexible in the assignment by passing lists, instead of individual labels. Moreover, labels in level 1 and 2 are preferably not separated. That is, (quux, two) and (corge, three) should be passed together, instead of per level. The reason I mention this, is that I see that in Q6, labels are passed per level (i.e. df.loc[(('a', 'b'), ('u', 'v')), :]
).
Any help is much appreciated.
CodePudding user response:
You can't select directly by exclusion. What you can do is building a 2D mask for indexing.
mask = pd.DataFrame(True, index=df1.index, columns=df1.columns)
idx = pd.MultiIndex.from_tuples([
('C', 'quux', 'two'),
('C', 'corge', 'three'),
('D', 'quux', 'two'),
('D', 'corge', 'three')])
mask.loc[idx, ['A', 'B']] = False
df1[mask] = 0
print(df1)
Output:
A B C D
qux quux corge qux quux corge qux quux corge qux quux corge
one two three one two three one two three one two three
A qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
D qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
CodePudding user response:
Thanks to @mozway's answer, I resolved this issue as follows:
l_col_lvl0 = ["A", "B"]
l_idx_lvl0 = ["C", "D"]
l_idx_lvl12 = [("quux", "two"), ("corge", "three")]
l_idx = []
for idx_lvl0 in l_idx_lvl0:
for t_idx_lvl1 in l_idx_lvl12:
idx_lvl1, idx_lvl2 = t_idx_lvl1
t_idx_lvl012 = (idx_lvl0, idx_lvl1, idx_lvl2)
l_idx.append(t_idx_lvl012)
idx = pd.MultiIndex.from_tuples(l_idx)
df1_sel = df1.loc[idx, l_col_lvl0]
df1_0 = df1.copy()*0
df1_0.loc[idx, l_col_lvl0] = df1_sel
A B C D
qux quux corge qux quux corge qux quux corge qux quux corge
one two three one two three one two three one two three
A qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
D qux one 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
quux two 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
corge three 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
In particular, the answer helped me realize that I should first verbosely create the MultiIndex with all levels, after which I can make the selection as intended.