I have this DataFrame:
data = {'year': [2019,2019,2019,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2020],
'month': [1,2,3,4,5,6,7,8,5,6,7,8,9,10],
'val1': [0,0,0,0,1,0,0,0,0,0,1,0,0,0],
'val2': [0,0,1,0,1,0,0,0,1,0,1,0,0,0],
'val3': [0,0,0,0,1,1,0,0,0,0,1,1,0,0]}
df = pd.DataFrame(data)
Printed data:
------ ------- ------ ------ ------
| year | month | val1 | val2 | val3 |
------ ------- ------ ------ ------
| 2019 | 1 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2019 | 2 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2019 | 3 | 0 | 1 | 0 |
------ ------- ------ ------ ------
| 2019 | 4 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2019 | 5 | 1 | 1 | 1 |
------ ------- ------ ------ ------
| 2019 | 6 | 0 | 0 | 1 |
------ ------- ------ ------ ------
| 2019 | 7 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2019 | 8 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2020 | 5 | 0 | 1 | 0 |
------ ------- ------ ------ ------
| 2020 | 6 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2020 | 7 | 1 | 1 | 1 |
------ ------- ------ ------ ------
| 2020 | 8 | 0 | 0 | 1 |
------ ------- ------ ------ ------
| 2020 | 9 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2020 | 10 | 0 | 0 | 0 |
------ ------- ------ ------ ------
Expected Output:
------ ------- ------ ------ ------
| year | month | val1 | val2 | val3 |
------ ------- ------ ------ ------
| 2019 | 3 | 0 | 1 | 0 |
------ ------- ------ ------ ------
| 2019 | 4 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2019 | 5 | 1 | 1 | 1 |
------ ------- ------ ------ ------
| 2019 | 6 | 0 | 0 | 1 |
------ ------- ------ ------ ------
| 2020 | 5 | 0 | 1 | 0 |
------ ------- ------ ------ ------
| 2020 | 6 | 0 | 0 | 0 |
------ ------- ------ ------ ------
| 2020 | 7 | 1 | 1 | 1 |
------ ------- ------ ------ ------
| 2020 | 8 | 0 | 0 | 1 |
------ ------- ------ ------ ------
I need to apply a groupby on the column year and month. But I am only interested in non-zero values in val1, val2 and val3 at the beginning and at the end. Zero values in the middle of the year should not be removed.
Is it possible to check the start of the groups and drop row by row until a valX column has a non-zero value. The same approach reversed from the end of the group.
CodePudding user response:
Crete mask for test if cumulative values per groups are equal 0
per all columns:
df1 = df.drop('month', 1)
m = df1.groupby('year').cumsum().ne(0) & df1.iloc[::-1].groupby('year').cumsum().ne(0)
df = df[m.any(axis=1)]
print(df)
year month val1 val2 val3
2 2019 3 0 1 0
3 2019 4 0 0 0
4 2019 5 1 1 1
5 2019 6 0 0 1
8 2020 5 0 1 0
9 2020 6 0 0 0
10 2020 7 1 1 1
11 2020 8 0 0 1
CodePudding user response:
IIUC, perform a groupby.cummax
in forward and reverse on the Series of booleans for any
non zero value per row, then slice if both conditions are True:
# is any val non 0 in the row?
s = df[['val1', 'val2', 'val3']].ne(0).any(1)
# or to select all "val" columns:
# s = df.filter(like='val').ne(0).any(1)
# forward check for intermediates
m1 = s.groupby(df['year']).cummax()
# reverse check for intermediates
m2 = s[::-1].groupby(df['year']).cummax()
# keep if both forward and reverse
out = df[m1&m2]
output:
year month val1 val2 val3
2 2019 3 0 1 0
3 2019 4 0 0 0
4 2019 5 1 1 1
5 2019 6 0 0 1
8 2020 5 0 1 0
9 2020 6 0 0 0
10 2020 7 1 1 1
11 2020 8 0 0 1
Intermediates:
year month val1 val2 val3 s m1 m2 m1&m2
0 2019 1 0 0 0 False False True False
1 2019 2 0 0 0 False False True False
2 2019 3 0 1 0 True True True True
3 2019 4 0 0 0 False True True True
4 2019 5 1 1 1 True True True True
5 2019 6 0 0 1 True True True True
6 2019 7 0 0 0 False True False False
7 2019 8 0 0 0 False True False False
8 2020 5 0 1 0 True True True True
9 2020 6 0 0 0 False True True True
10 2020 7 1 1 1 True True True True
11 2020 8 0 0 1 True True True True
12 2020 9 0 0 0 False True False False
13 2020 10 0 0 0 False True False False
For fun, imagine you wanted to keep only the leading and trailing zeros and not the middle ones: df[s|~(m1&m2)]