Problem to solve :
- When a column_A of a dataframe is filled with values which are not ordered, I would like to create a new column_B filled by the last highest previous value met in column_A. I tried to use the
rolling()
method, but it delivers wrong numbers compared to what I expected.
Reproductive example
A dataframe, is filled by not ordered numbers:
# creation of list of non ordered values
list_of_original_values=[100, 98, 102, 107, 94, 95, 96, 92, 150]
# creation of the dataframe
df = (
pd.DataFrame()
.assign(original_values = list_of_original_values)
)
# vizualisation of the dataframe
df
original_values
0 100
1 98
2 102
3 107
4 94
5 95
6 96
7 92
8 150
Next, row after row, we check what was the highest value of the first column in the previous rows, and we fill a new column with the last of highest value from first column, row after row.
Explanation :
- first row: the highest value of first column for this row and previous rows, is 100. Then , we put 100 into the second colum.
- second row: the highest value of first column for this row and previous rows, are 100 and 98. The, the highest value is 100. We put 100 in the second column.
- third row: the highest value of first column for this row and previous rows, are 100, 98, 102. The, the highest value is 102. We put 102 in the second column.
- and so on ...
Then it should give :
# expected values should be this list:
list_of_expected_values=[100, 100, 102, 107, 107, 107, 107, 107, 150]
At the end, the expected dataframe should be is one:
# expected dataframe
df = (
df.assign(expected_values = list_of_expected_values)
)
df
original_values expected_values
0 100 100
1 98 100
2 102 102
3 107 107
4 94 107
5 95 107
6 96 107
7 92 107
8 150 150
I've tried to use pandas rolling()
method, but the way I used it, goes to a wrong result regarding what was expected :
# trying to use rolling() and max()
df=df.assign(try_1_with_rolling = lambda df: df['original_values'].rolling(2).max())
# dataframe vizualisation :
df
original_values expected_values try_1_with_rolling
0 100 100 NaN
1 98 100 100.0
2 102 102 102.0
3 107 107 107.0
4 94 107 107.0
5 95 107 95.0
6 96 107 96.0
7 92 107 96.0
8 150 150 150.0
the context of this problem: From censors, we receive data, but time is not ordered: there are some batches. We'd like to detect those batches which are rows coming with date lower the the last higher date met before.
CodePudding user response:
I believe what you are looking for is cummax
df['expected_values'] = df['original_values'].cummax()