Home > Blockchain >  Dynamically remove leading and trailing rows after groupby if condition met
Dynamically remove leading and trailing rows after groupby if condition met

Time:08-12

I have this DataFrame:

data = {'year': [2019,2019,2019,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2020],
        'month': [1,2,3,4,5,6,7,8,5,6,7,8,9,10],
        'val1': [0,0,0,0,1,0,0,0,0,0,1,0,0,0],
        'val2': [0,0,1,0,1,0,0,0,1,0,1,0,0,0],
        'val3': [0,0,0,0,1,1,0,0,0,0,1,1,0,0]}
df = pd.DataFrame(data)

Printed data:

 ------ ------- ------ ------ ------ 
| year | month | val1 | val2 | val3 |
 ------ ------- ------ ------ ------ 
| 2019 | 1     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2019 | 2     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2019 | 3     | 0    | 1    | 0    |
 ------ ------- ------ ------ ------ 
| 2019 | 4     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2019 | 5     | 1    | 1    | 1    |
 ------ ------- ------ ------ ------ 
| 2019 | 6     | 0    | 0    | 1    |
 ------ ------- ------ ------ ------ 
| 2019 | 7     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2019 | 8     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2020 | 5     | 0    | 1    | 0    |
 ------ ------- ------ ------ ------ 
| 2020 | 6     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2020 | 7     | 1    | 1    | 1    |
 ------ ------- ------ ------ ------ 
| 2020 | 8     | 0    | 0    | 1    |
 ------ ------- ------ ------ ------ 
| 2020 | 9     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2020 | 10    | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 

Expected Output:

 ------ ------- ------ ------ ------ 
| year | month | val1 | val2 | val3 |
 ------ ------- ------ ------ ------ 
| 2019 | 3     | 0    | 1    | 0    |
 ------ ------- ------ ------ ------ 
| 2019 | 4     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2019 | 5     | 1    | 1    | 1    |
 ------ ------- ------ ------ ------ 
| 2019 | 6     | 0    | 0    | 1    |
 ------ ------- ------ ------ ------ 
| 2020 | 5     | 0    | 1    | 0    |
 ------ ------- ------ ------ ------ 
| 2020 | 6     | 0    | 0    | 0    |
 ------ ------- ------ ------ ------ 
| 2020 | 7     | 1    | 1    | 1    |
 ------ ------- ------ ------ ------ 
| 2020 | 8     | 0    | 0    | 1    |
 ------ ------- ------ ------ ------ 

I need to apply a groupby on the column year and month. But I am only interested in non-zero values in val1, val2 and val3 at the beginning and at the end. Zero values in the middle of the year should not be removed.

Is it possible to check the start of the groups and drop row by row until a valX column has a non-zero value. The same approach reversed from the end of the group.

CodePudding user response:

Crete mask for test if cumulative values per groups are equal 0 per all columns:

df1 = df.drop('month', 1)
m = df1.groupby('year').cumsum().ne(0) & df1.iloc[::-1].groupby('year').cumsum().ne(0)
df = df[m.any(axis=1)]
print(df)
    year  month  val1  val2  val3
2   2019      3     0     1     0
3   2019      4     0     0     0
4   2019      5     1     1     1
5   2019      6     0     0     1
8   2020      5     0     1     0
9   2020      6     0     0     0
10  2020      7     1     1     1
11  2020      8     0     0     1

CodePudding user response:

IIUC, perform a groupby.cummax in forward and reverse on the Series of booleans for any non zero value per row, then slice if both conditions are True:

# is any val non 0 in the row?
s = df[['val1', 'val2', 'val3']].ne(0).any(1)
# or to select all "val" columns:
# s = df.filter(like='val').ne(0).any(1)

# forward check for intermediates
m1 = s.groupby(df['year']).cummax()
# reverse check for intermediates
m2 = s[::-1].groupby(df['year']).cummax()

# keep if both forward and reverse
out = df[m1&m2]

output:

    year  month  val1  val2  val3
2   2019      3     0     1     0
3   2019      4     0     0     0
4   2019      5     1     1     1
5   2019      6     0     0     1
8   2020      5     0     1     0
9   2020      6     0     0     0
10  2020      7     1     1     1
11  2020      8     0     0     1

Intermediates:

    year  month  val1  val2  val3      s     m1     m2  m1&m2
0   2019      1     0     0     0  False  False   True  False
1   2019      2     0     0     0  False  False   True  False
2   2019      3     0     1     0   True   True   True   True
3   2019      4     0     0     0  False   True   True   True
4   2019      5     1     1     1   True   True   True   True
5   2019      6     0     0     1   True   True   True   True
6   2019      7     0     0     0  False   True  False  False
7   2019      8     0     0     0  False   True  False  False
8   2020      5     0     1     0   True   True   True   True
9   2020      6     0     0     0  False   True   True   True
10  2020      7     1     1     1   True   True   True   True
11  2020      8     0     0     1   True   True   True   True
12  2020      9     0     0     0  False   True  False  False
13  2020     10     0     0     0  False   True  False  False

For fun, imagine you wanted to keep only the leading and trailing zeros and not the middle ones: df[s|~(m1&m2)]

  • Related