I have a set of data where I want to do, If the latest st_1 or st_2 are greater than earlier st_1 or st_2 put True or False respectively in another column. How can I do that on the basis of date and id?
id date st_1 st_2
1 2022-02-28 00:00:00 00:00 60.0 6.0
2 2021-10-31 00:00:00 00:00 70.0 0.0
2 2021-12-31 00:00:00 00:00 70.0 4.0
3 2021-10-31 00:00:00 00:00 60.0 0.0
4 2021-06-30 00:00:00 00:00 63.3 2.66
4 2021-08-31 00:00:00 00:00 60.0 3.0
4 2022-02-28 00:00:00 00:00 70.0 2.0
5 2021-06-30 00:00:00 00:00 70.0 3.0
4 2022-02-28 00:00:00 00:00 70.0 2.0
5 2021-06-30 00:00:00 00:00 70.0 3.0
5 2021-08-31 00:00:00 00:00 80.0 2.0
5 2021-10-31 00:00:00 00:00 70.0 3.5
CodePudding user response:
Update: I just noticed the the spec is "If the latest st_1 or st_2 are greater than earlier st_1 or st_2" which means that the correct answer is to use "|" instead of the original answer's "&". Corrected. Here is the solution:
import io
import pandas as pd
string = """id date st_1 st_2
1 "2022-02-28 00:00:00 00:00" 60.0 6.0
2 "2021-10-31 00:00:00 00:00" 70.0 0.0
2 "2021-12-31 00:00:00 00:00" 70.0 4.0
3 "2021-10-31 00:00:00 00:00" 60.0 0.0
4 "2021-06-30 00:00:00 00:00" 63.3 2.66
4 "2021-08-31 00:00:00 00:00" 60.0 3.0
4 "2022-02-28 00:00:00 00:00" 70.0 2.0
5 "2021-06-30 00:00:00 00:00" 70.0 3.0
4 "2022-02-28 00:00:00 00:00" 70.0 2.0
5 "2021-06-30 00:00:00 00:00" 70.0 3.0
5 "2021-08-31 00:00:00 00:00" 80.0 2.0
5 "2021-10-31 00:00:00 00:00" 70.0 3.5
"""
data = io.StringIO(string)
df0 = pd.read_csv(data, sep="\s ") # Load df0 from the data string
df = df0.sort_values(['date', 'id']) # Sort according to the spec
df['lag_st_1'] = df['st_1'].shift(1) # Create column lag_st_1 with the st_1 data lagged by 1 row
df['lag_st_2'] = df['st_2'].shift(1) # Ditto for st_2
# Create result column with True values where the right conditions are met
df.loc[((df['st_1'] > df['lag_st_1']) | (df['st_2'] > df['lag_st_2'])), 'result'] = True
# The previous operation fills the rest of the rows with NAs.
# Here we change the NAs to "False"
df['result'] = df['result'].fillna(False)
print(df)
Updated output:
id date st_1 st_2 lag_st_1 lag_st_2 result
4 4 2021-06-30 00:00:00 00:00 63.3 2.66 NaN NaN False
7 5 2021-06-30 00:00:00 00:00 70.0 3.00 63.3 2.66 True
9 5 2021-06-30 00:00:00 00:00 70.0 3.00 70.0 3.00 False
5 4 2021-08-31 00:00:00 00:00 60.0 3.00 70.0 3.00 False
10 5 2021-08-31 00:00:00 00:00 80.0 2.00 60.0 3.00 True
1 2 2021-10-31 00:00:00 00:00 70.0 0.00 80.0 2.00 False
3 3 2021-10-31 00:00:00 00:00 60.0 0.00 70.0 0.00 False
11 5 2021-10-31 00:00:00 00:00 70.0 3.50 60.0 0.00 True
2 2 2021-12-31 00:00:00 00:00 70.0 4.00 70.0 3.50 True
0 1 2022-02-28 00:00:00 00:00 60.0 6.00 70.0 4.00 True
6 4 2022-02-28 00:00:00 00:00 70.0 2.00 60.0 6.00 True
8 4 2022-02-28 00:00:00 00:00 70.0 2.00 70.0 2.00 False
CodePudding user response:
IIUC, you want to check the conditions with the dates, but not have one date tested against a previous date. The logic in C Pappy's answer is better than this but this does the checks only within the date group, therefore it results in fewer 'True'. Please let us know which is correct.
df.sort_values(['date', 'id'], inplace=True)
df['st_1_check'] = False
df['st_2_check'] = False
def test_conditions(x):
if x.shape[0] > 1:
x.loc[:, 'st_1_check'] = x['st_1'] - x['st_1'].shift(1)
x.loc[:, 'st_2_check'] = x['st_2'] - x['st_2'].shift(1)
return x
dfnew = df.groupby(['date']).apply(test_conditions)
dfnew.fillna(False, inplace=True)
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] > 0, True, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] > 0, True, dfnew['st_2_check'])
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] <= 0, False, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] <= 0, False, dfnew['st_2_check'])
dfnew
id date st_1 st_2 st_1_check st_2_check
4 4 2021-06-30 00:00:00 00:00 63.30000 2.66000 False False
7 5 2021-06-30 00:00:00 00:00 70.00000 3.00000 True True
9 5 2021-06-30 00:00:00 00:00 70.00000 3.00000 False False
5 4 2021-08-31 00:00:00 00:00 60.00000 3.00000 False False
10 5 2021-08-31 00:00:00 00:00 80.00000 2.00000 True False
1 2 2021-10-31 00:00:00 00:00 70.00000 0.00000 False False
3 3 2021-10-31 00:00:00 00:00 60.00000 0.00000 False False
11 5 2021-10-31 00:00:00 00:00 70.00000 3.50000 True True
2 2 2021-12-31 00:00:00 00:00 70.00000 4.00000 False False
0 1 2022-02-28 00:00:00 00:00 60.00000 6.00000 False False
6 4 2022-02-28 00:00:00 00:00 70.00000 2.00000 True False
8 4 2022-02-28 00:00:00 00:00 70.00000 2.00000 False False