Home > Software engineering >  How to groupby and drop rows that include "-inf" or "inf" and all the previous r
How to groupby and drop rows that include "-inf" or "inf" and all the previous r

Time:05-05

I have the following df, I want to groupby "group" and drop not only rows that include NaN, but all the preceding values.

group  date     value    
  1    jan2019    NaN
  1    jan2019    3
  1    jan2019    NaN
  1    feb2019    3        
  1    mar2019    4       
  1    mar2019    5        
  2    feb2019    0        
  2    feb2019    NaN       
  2    mar2019    7        
  2    mar2019    4        
  2    apr2019    5        

desired df

group  date     value    
  1    feb2019    3        
  1    mar2019    4       
  1    mar2019    5                  
  2    mar2019    7        
  2    mar2019    4        
  2    apr2019    5        

CodePudding user response:

You can flag the inf values using abs eq. Then to flag all the values before it as well, you can reverse the order of the Series and use cummax. Since you want to do this job across groups, you can use groupby.cummax. Finally, use the boolean mask to filter the desired output via loc:

out = df.loc[~df['value'].abs().eq(float('inf'))[::-1].groupby(df['group']).cummax()]

If the values to flag are NaNs (rather than inf), then we could use isna instead:

out = df.loc[~df['value'].isna()[::-1].groupby(df['group']).cummax()]

Output:

   group     date  value
2      1  feb2019    3.0
3      1  mar2019    4.0
4      1  mar2019    5.0
7      2  mar2019    7.0
8      2  mar2019    4.0
9      2  apr2019    5.0

CodePudding user response:

A more verbose but possibly more readable approach. .isin will work for NaN as well as infs.

def filter_preceding(df, search_values=[np.nan, np.inf, -np.inf]):        
    ind = np.where(df['values'].isin(search_values))[0]
    if len(ind) == 0:
        return df
    max_ind = ind.max()
    return df.iloc[max_ind   1:, :]

df.groupby('group').apply(filter_preceding).reset_index(drop=True)
  • Related