Not sure if there is a more elegant way to do what I want to do. Basically, I need to determine the current row's "position" value based on the "factor" value and the previous row's "position" value.
I tried to loop through the DataFrame and use some if else statements to update the value, but it is very clumpy and the values didn't get updated.
Please kindly help, million thanks!
factor position
time
2022-05-13 06:00:00 0.489471 0
2022-05-13 07:00:00 0.711030 0
2022-05-13 08:00:00 0.566865 0
2022-05-13 09:00:00 0.489471 0
2022-05-13 10:00:00 0.288419 0
import pandas as pd
df = pd.DataFrame({'time': ['2022-05-13 06:00:00', '2022-05-13 07:00:00', '2022-05-13 08:00:00','2022-05-13 09:00:00', '2022-05-13 10:00:00'],
'factor': [0.489471, 0.711030, 0.566865, 0.489471, 0.288419],
'position': [0, 0, 0, 0, 0]})
df['time'] = pd.to_datetime(df['time'])
df.set_index('time', inplace=True)
threshold_2 = 0.7
threshold_1 = 0.35
for i in range(0, len(df)):
# no position
if i == 0 or df.iloc[i-1, :]['position'] == 0:
if df.iloc[i, :]['factor'] > threshold_2:
df.iloc[i, :]['position'] = 1
else:
df.iloc[i, :]['position'] = 0
#has position
elif df.iloc[i-1, :]['position'] != 0:
if df.iloc[i, :]['factor'] > threshold_1:
df.iloc[i, :]['position'] = 1
else:
df.iloc[i, :]['position'] = 0
CodePudding user response:
IIUC, you can use this vectorial alternative:
threshold_2 = 0.7
threshold_1 = 0.35
m1 = df['factor'].gt(threshold_2)
group = m1.cumsum()
m2 = df.loc[group>0, 'factor'].gt(threshold_1).groupby(group).cummin()
df['position'] = (m1|df.index.isin(m2[m2].index)).astype(int)
output:
factor position
time
2022-05-13 06:00:00 0.489471 0
2022-05-13 07:00:00 0.711030 1
2022-05-13 08:00:00 0.566865 1
2022-05-13 09:00:00 0.489471 1
2022-05-13 10:00:00 0.288419 0
CodePudding user response:
When you use chained indexing, the order and type of the indexing operation partially determine whether the result is a slice into the original object, or a copy of the slice.
Read more about it here
I made some little changes to your code, so that it is working, e.g you don't need the inner if/else, you can just write 1 if foo else 0
.
df = pd.DataFrame({'time': ['2022-05-13 06:00:00', '2022-05-13 07:00:00', '2022-05-13 08:00:00','2022-05-13 09:00:00', '2022-05-13 10:00:00'],
'factor': [0.489471, 0.711030, 0.566865, 0.489471, 0.288419],
'position': [0, 0, 0, 0, 0]})
df['time'] = pd.to_datetime(df['time'])
df.set_index('time', inplace=True)
threshold_2 = 0.7
threshold_1 = 0.35
for i in range(0, len(df)):
# no position
if i == 0 or df.loc[df.index[i-1], 'position'] == 0:
df.loc[df.index[i], 'position'] = 1 if df.loc[df.index[i], 'factor'] > threshold_2 else 0
#has position
elif df.loc[df.index[i-1], 'position'] != 0:
df.loc[df.index[i], 'position'] = 1 if df.loc[df.index[i], 'factor'] > threshold_1 else 0
print(df)
factor position
time
2022-05-13 06:00:00 0.489471 0
2022-05-13 07:00:00 0.711030 1
2022-05-13 08:00:00 0.566865 1
2022-05-13 09:00:00 0.489471 1
2022-05-13 10:00:00 0.288419 0