I am having some issues with what I thought a simple filtering task. We have a data that have roughly this shape :
name | item | cumsum |
---|---|---|
name1 | item 1 | 0.05 |
item 2 | 0.10 | |
item 3 | 0.31 | |
name2 | item 1 | 0.02 |
item 2 | 0.07 | |
name3 | item 1 | 0.01 |
item 2 | 0.07 | |
item 3 | 0.21 | |
name4 | item 1 | 0.03 |
item 2 | 0.12 | |
item 3 | 0.21 | |
item 4 | 0.35 |
What I would like to is to return the dataframe with items smaller than 0.2 and the item directly above. This is table I would like as an output:
name | item | cumsum |
---|---|---|
name1 | item 1 | 0.05 |
item 2 | 0.10 | |
item 3 | 0.31 | |
name2 | item 1 | 0.02 |
item 2 | 0.07 | |
name3 | item 1 | 0.01 |
item 2 | 0.07 | |
item 3 | 0.21 | |
name4 | item 1 | 0.03 |
item 2 | 0.12 | |
item 3 | 0.21 |
I tried for each 'name' to find the 'item' that have a cumsum greater than 0.2 and then return the whole range with the indexes as :
df = df.loc['name1']
idx = df.loc[df['cumsum'] > 0.2].index[0]
iidx = df.index.get_loc(idx) 1
df = df.iloc[:iidx]
and do this for each 'name'. However this fails for name2.
Can anybody help with this please ?
CodePudding user response:
Use |
for bitwise OR
by mask shifted per groups by DataFrameGroupBy.shift
:
m = (df['cumsum'] < 0.2)
df = df[m | m.groupby(level=0).shift(fill_value=False)]
print (df)
cumsum
name item
name1 item 1 0.05
item 2 0.10
item 3 0.31
name2 item 1 0.02
item 2 0.07
name3 item 1 0.01
item 2 0.07
item 3 0.21
name4 item 1 0.03
item 2 0.12
item 3 0.21