Home > Net >  pandas groupby and get all null rows till the first non null value in multiple columns
pandas groupby and get all null rows till the first non null value in multiple columns

Time:03-01

I am trying to get all null rows above a value using group by.

So, for example given following dataframe.

 ---- ------------ ----------- -------- ---------- -------- ---------- 
| ID | Start Date | End Date  | Date_D | D-Values | Date_R | R-Values |
 ---- ------------ ----------- -------- ---------- -------- ---------- 
| A  | 2/26/2015  | 5/26/2015 | JAN_15 |        - | 15-Jan |        - |
| A  | 2/26/2015  | 5/26/2015 | FEB_15 |        - | 15-Feb |        - |
| A  | 2/26/2015  | 5/26/2015 | MAR_15 |        - | 15-Mar |        - |
| A  | 2/26/2015  | 5/26/2015 | APR_15 |        - | 15-Apr |        - |
| A  | 2/26/2015  | 5/26/2015 | MAY_15 |      -28 | 15-May |    15000 |
| A  | 2/26/2015  | 5/26/2015 | JUN_15 |        - | 15-Jun |        - |
| A  | 2/26/2015  | 5/26/2015 | JUL_15 |        - | 15-Jul |        - |
| A  | 2/26/2015  | 5/26/2015 | AUG_15 |        - | 15-Aug |        - |
 ---- ------------ ----------- -------- ---------- -------- ---------- 

My desired output looks like this.

 ---- ------------ ----------- -------- ---------- -------- ---------- 
| ID | Start Date | End Date  | Date_D | D-Values | Date_R | R-Values |
 ---- ------------ ----------- -------- ---------- -------- ---------- 
| A  | 2/26/2015  | 5/26/2015 | FEB_15 |        - | 15-Feb |        - |
| A  | 2/26/2015  | 5/26/2015 | MAR_15 |        - | 15-Mar |        - |
| A  | 2/26/2015  | 5/26/2015 | APR_15 |        - | 15-Apr |        - |
| A  | 2/26/2015  | 5/26/2015 | MAY_15 |      -28 | 15-May |    15000 |
 ---- ------------ ----------- -------- ---------- -------- ---------- 

EDIT

There are multiple IDs so need to implement on multiple customers. Want rows based on start date and end date e.g. start selecting rows from Feb_15 and upto the last non-null value within the date range.

CodePudding user response:

Use DataFrame.isna() and Series.cumprod() to check first not null:

df[df[['D-Values','R-Values']]
     #.replace('-', np.nan) # if necessary
     .isna()
     .any(axis=1)
     .groupby(df['ID'])
     .cumprod()
     .groupby(df['ID'])
     .shift(fill_value=True)
     .astype(bool) 
    & df['Date_D'].eq('FEB_15')
                  #.groupby(df['ID']) # BY ID
                  .cummax()
                  .eq(1) 
   ]

CodePudding user response:

You can do transform with idxmax

idx = df[['D-Values','R-Values']].notna().all(1).groupby(df["ID"]).transform('idxmax')
out = df[df.index <= idx]

CodePudding user response:

IIUC, you want to remove the last rows with "-", and assuming 'D-values" the reference column.

You can compute a cummax on the reversed boolean array:

mask = df['D-Values'].ne('-').iloc[::-1].cummax()
# or, for NaNs:
# mask = df['D-Values'].notna().iloc[::-1].cummax()

df2 = df[mask]

output:

  ID Start Date   End Date  Date_D D-Values  Date_R R-Values
0  A  1/26/2015  5/26/2015  JAN_15        -  15-Jan        -
1  A  1/26/2015  5/26/2015  FEB_15        -  15-Feb        -
2  A  1/26/2015  5/26/2015  MAR_15        -  15-Mar        -
3  A  1/26/2015  5/26/2015  APR_15        -  15-Apr        -
4  A  1/26/2015  5/26/2015  MAY_15      -28  15-May    15000
  • Related