I have this dataframe:
1 2 3
datetime
2021-01-23 00:01:00 10.0 11.0 15.0
2021-01-23 00:02:00 12.0 10.0 NaN
2021-01-23 00:03:00 16.0 12.0 NaN
(In reality, there are more columns like 1-3...)
I want to add two columns like this:
1 2 3 new_max new_max_col
datetime
2021-01-23 00:01:00 10.0 11.0 15.0 15.0 3
2021-01-23 00:02:00 12.0 10.0 NaN NaN NaN
2021-01-23 00:03:00 16.0 12.0 NaN 16.0 1
Actions needed:
Column
new_max
: Copy the highest value in the row to this column, but in case there is an NaN in this row, only if it is higher than the last value (from previous rows) before NaN. Otherwise set the value innew_max
to NaN.Column
new_max_col
: Set to the name of the column where thenew_max
value was taken from.
My problem: If I would simply use .fillna(method='ffill')
before comparison, I would get 15.0 in new_max
for the second row, but this would be wrong. Only "new" high values should be copied to new_max
, while values from .fillna
should be used for comparison only.
Code to create the original dataframe:
import io, pandas as pd, numpy as np
t = io.StringIO("""
datetime|1|2|3
2021-01-23 00:01:00|10|11|15
2021-01-23 00:02:00|12|10|NaN
2021-01-23 00:03:00|16|12|NaN""")
df = pd.read_csv(t, sep='|', parse_dates=['datetime'], dtype=np.float64).set_index('datetime')
!!! EDIT !!!
Here is more and different testing data to describe the problem:
1 2 3
datetime
2021-01-23 00:00:00 20.0 NaN 15.0
2021-01-23 00:01:00 12.0 NaN NaN
2021-01-23 00:02:00 10.0 17.0 NaN
2021-01-23 00:03:00 NaN 14.0 18.0
2021-01-23 00:04:00 16.0 12.0 NaN
Note that there could be more than 2 consecutive NaN in a column (not in this example).
Result should be:
1 2 3 new_max new_max_col
datetime
2021-01-23 00:00:00 20.0 NaN 15.0 20.0 1
2021-01-23 00:01:00 12.0 NaN NaN NaN NaN
2021-01-23 00:02:00 10.0 17.0 NaN 17.0 2
2021-01-23 00:03:00 NaN 14.0 18.0 18.0 3
2021-01-23 00:04:00 16.0 12.0 NaN NaN NaN
- Row 0: 20 is the highest.
- Row 1: Compare 12 and 15 (15 is lookback from first row in column "3"). 15 is the highest, but it is not "new" (it comes from looking back to the last number before NaNs in the column), so the result is NaN.
- Row 2: Compare 10, 17, 15 (15 is lookback). 17 is the highest.
- Row 3: Compare 10 (lookback), 14, 18. 18 is the highest.
- Row 4: Compare 16, 12, 18 (lookback). 18 is the highest but it's from lookback, so result is NaN.
Code to generate new testing dataframe:
import io, pandas as pd, numpy as np
t = io.StringIO("""
datetime|1|2|3
2021-01-23 00:00:00| 20|NaN| 15
2021-01-23 00:01:00| 12|NaN|NaN
2021-01-23 00:02:00| 10| 17|NaN
2021-01-23 00:03:00|NaN| 14| 18
2021-01-23 00:04:00| 16| 12|NaN""")
df = pd.read_csv(t, sep='|', parse_dates=['datetime'], dtype=np.float64).set_index('datetime')
CodePudding user response:
If first row has no missing values is possible compare by Series.cummax
with rows with missing values and set to NaN
s:
df1=df.agg(['max','idxmax'],axis=1).rename(columns={'idxmax':'max_col'}).add_prefix('new_')
mask = df1['new_max'].cummax().ne(df1['new_max']) & df.isna().any(axis=1)
df1 = df.join(df1.mask(mask))
print (df1)
1 2 3 new_max new_max_col
datetime
2021-01-23 00:01:00 10.0 11.0 15.0 15.0 3
2021-01-23 00:02:00 12.0 10.0 NaN NaN NaN
2021-01-23 00:03:00 16.0 12.0 NaN 16.0 1
EDIT:
#forward fill NaNs
df11 = df.ffill()
df1 = (df11.agg(['max','idxmax'],axis=1)
.rename(columns={'idxmax':'max_col'})
.add_prefix('new_'))
#comapre only ffill values by new_max
mask = df11.where(df.isna()).max(axis=1).eq(df1['new_max'])
df1 = df.join(df1.mask(mask))
print (df1)
1 2 3 new_max new_max_col
datetime
2021-01-23 00:00:00 20.0 NaN 15.0 20.0 1
2021-01-23 00:01:00 12.0 NaN NaN NaN NaN
2021-01-23 00:02:00 10.0 17.0 NaN 17.0 2
2021-01-23 00:03:00 NaN 14.0 18.0 18.0 3
2021-01-23 00:04:00 16.0 12.0 NaN NaN NaN
EDIT1: If need also compare original max values add another mask:
print (df)
1 2 3
datetime
2021-01-23 00:00:00 20.0 NaN 15.0
2021-01-23 00:01:00 12.0 NaN NaN
2021-01-23 00:02:00 10.0 17.0 NaN
2021-01-23 00:03:00 NaN 14.0 18.0
2021-01-23 00:04:00 18.0 12.0 NaN <- changed value to 18
#forward fill NaNs
df11 = df.ffill()
df1 = (df11.agg(['max','idxmax'],axis=1)
.rename(columns={'idxmax':'max_col'})
.add_prefix('new_'))
#compare only ffill values by new_max
mask = df11.where(df.isna()).max(axis=1).eq(df1['new_max'])
mask1 = df1['new_max'].ne(df.max(axis=1))
df1 = df.join(df1.mask(mask & mask1))
print (df1)
1 2 3 new_max new_max_col
datetime
2021-01-23 00:00:00 20.0 NaN 15.0 20.0 1
2021-01-23 00:01:00 12.0 NaN NaN NaN NaN
2021-01-23 00:02:00 10.0 17.0 NaN 17.0 2
2021-01-23 00:03:00 NaN 14.0 18.0 18.0 3
2021-01-23 00:04:00 18.0 12.0 NaN 18.0 1