I am trying to get all null rows above a value using group by.
So, for example given following dataframe.
---- ------------ ----------- -------- ---------- -------- ----------
| ID | Start Date | End Date | Date_D | D-Values | Date_R | R-Values |
---- ------------ ----------- -------- ---------- -------- ----------
| A | 2/26/2015 | 5/26/2015 | JAN_15 | - | 15-Jan | - |
| A | 2/26/2015 | 5/26/2015 | FEB_15 | - | 15-Feb | - |
| A | 2/26/2015 | 5/26/2015 | MAR_15 | - | 15-Mar | - |
| A | 2/26/2015 | 5/26/2015 | APR_15 | - | 15-Apr | - |
| A | 2/26/2015 | 5/26/2015 | MAY_15 | -28 | 15-May | 15000 |
| A | 2/26/2015 | 5/26/2015 | JUN_15 | - | 15-Jun | - |
| A | 2/26/2015 | 5/26/2015 | JUL_15 | - | 15-Jul | - |
| A | 2/26/2015 | 5/26/2015 | AUG_15 | - | 15-Aug | - |
---- ------------ ----------- -------- ---------- -------- ----------
My desired output looks like this.
---- ------------ ----------- -------- ---------- -------- ----------
| ID | Start Date | End Date | Date_D | D-Values | Date_R | R-Values |
---- ------------ ----------- -------- ---------- -------- ----------
| A | 2/26/2015 | 5/26/2015 | FEB_15 | - | 15-Feb | - |
| A | 2/26/2015 | 5/26/2015 | MAR_15 | - | 15-Mar | - |
| A | 2/26/2015 | 5/26/2015 | APR_15 | - | 15-Apr | - |
| A | 2/26/2015 | 5/26/2015 | MAY_15 | -28 | 15-May | 15000 |
---- ------------ ----------- -------- ---------- -------- ----------
EDIT
There are multiple IDs so need to implement on multiple customers. Want rows based on start date and end date e.g. start selecting rows from Feb_15 and upto the last non-null value within the date range.
CodePudding user response:
Use DataFrame.isna()
and Series.cumprod()
to check first not null:
df[df[['D-Values','R-Values']]
#.replace('-', np.nan) # if necessary
.isna()
.any(axis=1)
.groupby(df['ID'])
.cumprod()
.groupby(df['ID'])
.shift(fill_value=True)
.astype(bool)
& df['Date_D'].eq('FEB_15')
#.groupby(df['ID']) # BY ID
.cummax()
.eq(1)
]
CodePudding user response:
You can do transform
with idxmax
idx = df[['D-Values','R-Values']].notna().all(1).groupby(df["ID"]).transform('idxmax')
out = df[df.index <= idx]
CodePudding user response:
IIUC, you want to remove the last rows with "-", and assuming 'D-values" the reference column.
You can compute a cummax
on the reversed boolean array:
mask = df['D-Values'].ne('-').iloc[::-1].cummax()
# or, for NaNs:
# mask = df['D-Values'].notna().iloc[::-1].cummax()
df2 = df[mask]
output:
ID Start Date End Date Date_D D-Values Date_R R-Values
0 A 1/26/2015 5/26/2015 JAN_15 - 15-Jan -
1 A 1/26/2015 5/26/2015 FEB_15 - 15-Feb -
2 A 1/26/2015 5/26/2015 MAR_15 - 15-Mar -
3 A 1/26/2015 5/26/2015 APR_15 - 15-Apr -
4 A 1/26/2015 5/26/2015 MAY_15 -28 15-May 15000