Setting values in subselection of MultiIndex pandas-CodePudding

Consider the following DataFrame:

import numpy as np
import pandas as pd

arrays1 = [
    [
        "A",
        "A",
        "A",
        "B",
        "B",
        "B",
        "C",
        "C",
        "C",
        "D",
        "D",
        "D",
    ],
    [
        "qux",
        "quux",
        "corge",
        "qux",
        "quux",
        "corge",
        "qux",
        "quux",
        "corge",
        "qux",
        "quux",
        "corge",
    ],
    [
        "one",
        "two",
        "three",
        "one",
        "two",
        "three",
        "one",
        "two",
        "three",
        "one",
        "two",
        "three",
    ],
]
tuples1 = list(zip(*arrays1))
index_values1 = pd.MultiIndex.from_tuples(tuples1)
df1 = pd.DataFrame(
    np.ones((12, 12)), index=index_values1, columns=index_values1
)

Yielding:

                 A               B               C               D           
               qux quux corge  qux quux corge  qux quux corge  qux quux corge
               one  two three  one  two three  one  two three  one  two three
A qux   one    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
B qux   one    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
C qux   one    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
D qux   one    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0  1.0  1.0   1.0

Say I want to set everything to zero, except for the following rows and columns:

                 A               B               C               D           
               qux quux corge  qux quux corge  qux quux corge  qux quux corge
               one  two three  one  two three  one  two three  one  two three
A qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
B qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
C qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
D qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0

That is set everything to zero except for column A and B, and (quux, two), (corge, three) in index C and D.

To select column A and B and (quux, two), (corge, three) in both index B and C, I expected to be able to do:

l_col_lvl0 = ['A', 'B']
l_idx_lvl0 = ['C', 'D']
l_idx_lvl1 = [("quux", "two"), ("corge", "three")]
df1_a_bc_qc = df1.loc[(l_idx_lvl0, l_idx_lvl1), l_col_lvl0]

However, this returns an empty DataFrame, and the following error message:

FutureWarning: The behavior of indexing on a MultiIndex with a nested sequence of 
labels is deprecated and will change in a future version. `series.loc[label, sequence]` 
will raise if any members of 'sequence' or not present in the index's second level. 
To retain the old behavior, use `series.index.isin(sequence, level=1)`
  df1_a_bc_qc = df1.loc[(l_idx_lvl0, l_idx_lvl1), l_col_lvl0]

In turn, I can select column A and B, and indices C and D.

df1_ab_cd = df1.loc[l_idx_lvl0, l_col_lvl0]

                 A               B           
               qux quux corge  qux quux corge
               one  two three  one  two three
C qux   one    1.0  1.0   1.0  1.0  1.0   1.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0
D qux   one    1.0  1.0   1.0  1.0  1.0   1.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0

Moreover, I can select (quux, two), (corge, three) in either index C or D:

df1_ab_c_qc = df1.loc['C',l_col_lvl0].loc[l_idx_lvl1]

                   A               B           
             qux quux corge  qux quux corge
             one  two three  one  two three
quux  two    1.0  1.0   1.0  1.0  1.0   1.0
corge three  1.0  1.0   1.0  1.0  1.0   1.0

df1_ab_d_qc = df1.loc['D',l_col_lvl0].loc[l_idx_lvl1]

                   A               B           
             qux quux corge  qux quux corge
             one  two three  one  two three
quux  two    1.0  1.0   1.0  1.0  1.0   1.0
corge three  1.0  1.0   1.0  1.0  1.0   1.0

However, if I understand correctly, chained assignments are discouraged.

Moreover, if I try to pass l_idx_lvl0 instead, I get the following error message:

df1_ab_cd_qc = df1.loc[l_idx_lvl0,l_col_lvl0].loc[l_idx_lvl1]

ValueError: operands could not be broadcast together with shapes (2,2) (3,) (2,2)

In conclusion, how can I set verything to 0, except for column A and B, and (quux, two), (corge, three) in index B and C?

I believe that the solution to Question 6 (Select rows in pandas MultiIndex DataFrame) is very close to what I'm looking for, though I haven't gotten it to work for this case.

I would like to be flexible in the assignment by passing lists, instead of individual labels. Moreover, labels in level 1 and 2 are preferably not separated. That is, (quux, two) and (corge, three) should be passed together, instead of per level. The reason I mention this, is that I see that in Q6, labels are passed per level (i.e. df.loc[(('a', 'b'), ('u', 'v')), :]).

Any help is much appreciated.

CodePudding user response：

You can't select directly by exclusion. What you can do is building a 2D mask for indexing.

mask = pd.DataFrame(True, index=df1.index, columns=df1.columns)
idx = pd.MultiIndex.from_tuples([
            ('C',  'quux',   'two'),
            ('C', 'corge', 'three'),
            ('D',  'quux',   'two'),
            ('D', 'corge', 'three')])


mask.loc[idx, ['A', 'B']] = False
df1[mask] = 0

print(df1)

Output:


                 A               B               C               D           
               qux quux corge  qux quux corge  qux quux corge  qux quux corge
               one  two three  one  two three  one  two three  one  two three
A qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
B qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
C qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
D qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0

CodePudding user response：

Thanks to @mozway's answer, I resolved this issue as follows:

l_col_lvl0 = ["A", "B"]
l_idx_lvl0 = ["C", "D"]
l_idx_lvl12 = [("quux", "two"), ("corge", "three")]

l_idx = []
for idx_lvl0 in l_idx_lvl0:
    for t_idx_lvl1 in l_idx_lvl12:
        idx_lvl1, idx_lvl2 = t_idx_lvl1
        t_idx_lvl012 = (idx_lvl0, idx_lvl1, idx_lvl2)
        l_idx.append(t_idx_lvl012)

idx = pd.MultiIndex.from_tuples(l_idx)

df1_sel = df1.loc[idx, l_col_lvl0]
df1_0 = df1.copy()*0
df1_0.loc[idx, l_col_lvl0] = df1_sel



                 A               B               C               D           
               qux quux corge  qux quux corge  qux quux corge  qux quux corge
               one  two three  one  two three  one  two three  one  two three
A qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
B qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
C qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
D qux   one    0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0   0.0
  quux  two    1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0
  corge three  1.0  1.0   1.0  1.0  1.0   1.0  0.0  0.0   0.0  0.0  0.0   0.0

In particular, the answer helped me realize that I should first verbosely create the MultiIndex with all levels, after which I can make the selection as intended.