Home > Net >  Drop rows from a slice of Multi-Index DataFrame based on boolean
Drop rows from a slice of Multi-Index DataFrame based on boolean

Time:11-05

EDIT: Upon request I provide an example that is closer to the real data I am working with.

So I have a table data that looks something like

            value0    value1    value2
run step                              
0   0      0.12573 -0.132105  0.640423
    1       0.1049 -0.535669  0.361595
    2        1.304  0.947081 -0.703735
    3    -1.265421 -0.623274  0.041326
    4    -2.325031 -0.218792 -1.245911
    5    -0.732267 -0.544259   -0.3163
1   0     0.411631  1.042513 -0.128535
    1     1.366463 -0.665195   0.35151
    2      0.90347  0.094012 -0.743499
    3    -0.921725 -0.457726  0.220195
    4    -1.009618 -0.209176 -0.159225
    5     0.540846  0.214659  0.355373

(think: collection of time series) and a second table valid_range

    start   stop
run         
0   1   3
1   2   5

For each run I want to drop all rows that do not satisfy start≤step≤stop.

I tried the following (table generating code at the end)

for idx in valid_range.index:
    slc = data.loc[idx]
    start, stop = valid_range.loc[idx]
    cond = (start <= slc.index) & (slc.index <= stop)
    data.loc[idx] = data.loc[idx][cond]

However, this results in:

         value0 value1 value2
run step                     
0   0       NaN    NaN    NaN
    1       NaN    NaN    NaN
    2       NaN    NaN    NaN
    3       NaN    NaN    NaN
    4       NaN    NaN    NaN
    5       NaN    NaN    NaN
1   0       NaN    NaN    NaN
    1       NaN    NaN    NaN
    2       NaN    NaN    NaN
    3       NaN    NaN    NaN
    4       NaN    NaN    NaN
    5       NaN    NaN    NaN

I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...


Generating code for table

import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)

valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)

for run in range(2):
    for step in range(6):
        data.loc[(run, step), :] = rng.normal(size=(3))
)

CodePudding user response:

First, merge data and valid range based on 'run', using the merge method

>>> data

            value0     value1    value2
run step                               
0   0      0.12573  -0.132105  0.640423
    1       0.1049  -0.535669  0.361595
    2        1.304   0.947081 -0.703735
    3     -1.26542  -0.623274  0.041326
    4     -2.32503  -0.218792  -1.24591
    5    -0.732267  -0.544259   -0.3163
1   0     0.411631    1.04251 -0.128535
    1      1.36646  -0.665195   0.35151
    2      0.90347  0.0940123 -0.743499
    3    -0.921725  -0.457726  0.220195
    4     -1.00962  -0.209176 -0.159225
    5     0.540846   0.214659  0.355373


>>> valid_range

     start  stop
run             
0        1     3
1        2     5

>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged 

    run  step    value0     value1    value2  start  stop
0     0     0   0.12573  -0.132105  0.640423      1     3
1     0     1    0.1049  -0.535669  0.361595      1     3
2     0     2     1.304   0.947081 -0.703735      1     3
3     0     3  -1.26542  -0.623274  0.041326      1     3
4     0     4  -2.32503  -0.218792  -1.24591      1     3
5     0     5 -0.732267  -0.544259   -0.3163      1     3
6     1     0  0.411631    1.04251 -0.128535      2     5
7     1     1   1.36646  -0.665195   0.35151      2     5
8     1     2   0.90347  0.0940123 -0.743499      2     5
9     1     3 -0.921725  -0.457726  0.220195      2     5
10    1     4  -1.00962  -0.209176 -0.159225      2     5
11    1     5  0.540846   0.214659  0.355373      2     5

Then select the rows which satisfy the condition using eval. Use the boolean array to mask data

>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]

            value0    value1    value2
run step                              
0   2        1.304  0.947081 -0.703735
1   3    -0.921725 -0.457726  0.220195
    4     -1.00962 -0.209176 -0.159225

Or if you want, here is a similar approach using query

res = (
    data.reset_index()
        .merge(valid_range, on='run', how='left')
        .query('start < step < stop')
        .drop(columns=['start','stop'])
        .set_index(['run', 'step'])
)

CodePudding user response:

I would go on groupby like this:

(df.groupby(level=0)
   .apply(lambda x: x[x['small']>1])
   .reset_index(level=0, drop=True)    # remove duplicate index
)

which gives:

                           big  small
animal animal attribute              
cow    cow    speed       30.0   20.0
              weight     250.0  150.0
falcon falcon speed      320.0  250.0
lama   lama   speed       45.0   30.0
              weight     200.0  100.0
  • Related