Home > Enterprise >  Get highest value in row, but only under specific condition from previous rows
Get highest value in row, but only under specific condition from previous rows

Time:12-07

I have this dataframe:

                        1     2     3
datetime                             
2021-01-23 00:01:00  10.0  11.0  15.0
2021-01-23 00:02:00  12.0  10.0   NaN
2021-01-23 00:03:00  16.0  12.0   NaN

(In reality, there are more columns like 1-3...)

I want to add two columns like this:

                        1     2     3  new_max  new_max_col
datetime                             
2021-01-23 00:01:00  10.0  11.0  15.0     15.0            3
2021-01-23 00:02:00  12.0  10.0   NaN      NaN          NaN
2021-01-23 00:03:00  16.0  12.0   NaN     16.0            1

Actions needed:

  • Column new_max: Copy the highest value in the row to this column, but in case there is an NaN in this row, only if it is higher than the last value (from previous rows) before NaN. Otherwise set the value in new_max to NaN.

  • Column new_max_col: Set to the name of the column where the new_max value was taken from.

My problem: If I would simply use .fillna(method='ffill') before comparison, I would get 15.0 in new_max for the second row, but this would be wrong. Only "new" high values should be copied to new_max, while values from .fillna should be used for comparison only.

Code to create the original dataframe:

import io, pandas as pd, numpy as np

t = io.StringIO("""
datetime|1|2|3
2021-01-23 00:01:00|10|11|15
2021-01-23 00:02:00|12|10|NaN
2021-01-23 00:03:00|16|12|NaN""")
df = pd.read_csv(t, sep='|', parse_dates=['datetime'], dtype=np.float64).set_index('datetime')

!!! EDIT !!!

Here is more and different testing data to describe the problem:

                        1     2     3
datetime                             
2021-01-23 00:00:00  20.0   NaN  15.0
2021-01-23 00:01:00  12.0   NaN   NaN
2021-01-23 00:02:00  10.0  17.0   NaN
2021-01-23 00:03:00   NaN  14.0  18.0
2021-01-23 00:04:00  16.0  12.0   NaN

Note that there could be more than 2 consecutive NaN in a column (not in this example).

Result should be:

                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:00:00  20.0   NaN  15.0    20.0           1
2021-01-23 00:01:00  12.0   NaN   NaN     NaN         NaN
2021-01-23 00:02:00  10.0  17.0   NaN    17.0           2
2021-01-23 00:03:00   NaN  14.0  18.0    18.0           3
2021-01-23 00:04:00  16.0  12.0   NaN     NaN         NaN

- Row 0: 20 is the highest.

- Row 1: Compare 12 and 15 (15 is lookback from first row in column "3"). 15 is the highest, but it is not "new" (it comes from looking back to the last number before NaNs in the column), so the result is NaN.

- Row 2: Compare 10, 17, 15 (15 is lookback). 17 is the highest.

- Row 3: Compare 10 (lookback), 14, 18. 18 is the highest.

- Row 4: Compare 16, 12, 18 (lookback). 18 is the highest but it's from lookback, so result is NaN.

Code to generate new testing dataframe:

import io, pandas as pd, numpy as np

t = io.StringIO("""
datetime|1|2|3
2021-01-23 00:00:00| 20|NaN| 15
2021-01-23 00:01:00| 12|NaN|NaN
2021-01-23 00:02:00| 10| 17|NaN
2021-01-23 00:03:00|NaN| 14| 18
2021-01-23 00:04:00| 16| 12|NaN""")
df = pd.read_csv(t, sep='|', parse_dates=['datetime'], dtype=np.float64).set_index('datetime')

CodePudding user response:

If first row has no missing values is possible compare by Series.cummax with rows with missing values and set to NaNs:

df1=df.agg(['max','idxmax'],axis=1).rename(columns={'idxmax':'max_col'}).add_prefix('new_')

mask = df1['new_max'].cummax().ne(df1['new_max']) & df.isna().any(axis=1)

df1 = df.join(df1.mask(mask))
print (df1)
                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:01:00  10.0  11.0  15.0    15.0           3
2021-01-23 00:02:00  12.0  10.0   NaN     NaN         NaN
2021-01-23 00:03:00  16.0  12.0   NaN    16.0           1

EDIT:

#forward fill NaNs
df11 = df.ffill()
df1 = (df11.agg(['max','idxmax'],axis=1)
           .rename(columns={'idxmax':'max_col'})
           .add_prefix('new_'))

#comapre only ffill values by new_max
mask = df11.where(df.isna()).max(axis=1).eq(df1['new_max'])

df1 = df.join(df1.mask(mask))
print (df1)
                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:00:00  20.0   NaN  15.0    20.0           1
2021-01-23 00:01:00  12.0   NaN   NaN     NaN         NaN
2021-01-23 00:02:00  10.0  17.0   NaN    17.0           2
2021-01-23 00:03:00   NaN  14.0  18.0    18.0           3
2021-01-23 00:04:00  16.0  12.0   NaN     NaN         NaN

EDIT1: If need also compare original max values add another mask:

print (df)
                        1     2     3
datetime                             
2021-01-23 00:00:00  20.0   NaN  15.0
2021-01-23 00:01:00  12.0   NaN   NaN
2021-01-23 00:02:00  10.0  17.0   NaN
2021-01-23 00:03:00   NaN  14.0  18.0
2021-01-23 00:04:00  18.0  12.0   NaN <- changed value to 18

#forward fill NaNs
df11 = df.ffill()
df1 = (df11.agg(['max','idxmax'],axis=1)
           .rename(columns={'idxmax':'max_col'})
           .add_prefix('new_'))

#compare only ffill values by new_max
mask = df11.where(df.isna()).max(axis=1).eq(df1['new_max'])
mask1 = df1['new_max'].ne(df.max(axis=1))
df1 = df.join(df1.mask(mask & mask1))
print (df1)
                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:00:00  20.0   NaN  15.0    20.0           1
2021-01-23 00:01:00  12.0   NaN   NaN     NaN         NaN
2021-01-23 00:02:00  10.0  17.0   NaN    17.0           2
2021-01-23 00:03:00   NaN  14.0  18.0    18.0           3
2021-01-23 00:04:00  18.0  12.0   NaN    18.0           1
  • Related