Home > OS >  Python Data Grouping and compare on the basis of date
Python Data Grouping and compare on the basis of date

Time:02-20

I have a set of data where I want to do, If the latest st_1 or st_2 are greater than earlier st_1 or st_2 put True or False respectively in another column. How can I do that on the basis of date and id?

id  date                        st_1    st_2
1   2022-02-28 00:00:00 00:00   60.0    6.0
2   2021-10-31 00:00:00 00:00   70.0    0.0
2   2021-12-31 00:00:00 00:00   70.0    4.0
3   2021-10-31 00:00:00 00:00   60.0    0.0
4   2021-06-30 00:00:00 00:00   63.3    2.66
4   2021-08-31 00:00:00 00:00   60.0    3.0
4   2022-02-28 00:00:00 00:00   70.0    2.0
5   2021-06-30 00:00:00 00:00   70.0    3.0
4   2022-02-28 00:00:00 00:00   70.0    2.0
5   2021-06-30 00:00:00 00:00   70.0    3.0
5   2021-08-31 00:00:00 00:00   80.0    2.0
5   2021-10-31 00:00:00 00:00   70.0    3.5

CodePudding user response:

Update: I just noticed the the spec is "If the latest st_1 or st_2 are greater than earlier st_1 or st_2" which means that the correct answer is to use "|" instead of the original answer's "&". Corrected. Here is the solution:

import io
import pandas as pd
string = """id  date st_1 st_2
1  "2022-02-28 00:00:00 00:00"  60.0    6.0
2  "2021-10-31 00:00:00 00:00"  70.0    0.0
2  "2021-12-31 00:00:00 00:00"  70.0    4.0
3  "2021-10-31 00:00:00 00:00"  60.0    0.0
4  "2021-06-30 00:00:00 00:00"  63.3    2.66
4  "2021-08-31 00:00:00 00:00"  60.0    3.0
4  "2022-02-28 00:00:00 00:00"  70.0    2.0
5  "2021-06-30 00:00:00 00:00"  70.0    3.0
4  "2022-02-28 00:00:00 00:00"  70.0    2.0
5  "2021-06-30 00:00:00 00:00"  70.0    3.0
5  "2021-08-31 00:00:00 00:00"  80.0    2.0
5  "2021-10-31 00:00:00 00:00"  70.0    3.5
"""
data = io.StringIO(string)
df0 = pd.read_csv(data, sep="\s ")  # Load df0 from the data string
df = df0.sort_values(['date', 'id'])  # Sort according to the spec
df['lag_st_1'] = df['st_1'].shift(1)  # Create column lag_st_1 with the st_1 data lagged by 1 row
df['lag_st_2'] = df['st_2'].shift(1)  # Ditto for st_2
# Create result column with True values where the right conditions are met
df.loc[((df['st_1'] > df['lag_st_1']) | (df['st_2'] > df['lag_st_2'])), 'result'] = True
# The previous operation fills the rest of the rows with NAs.
# Here we change the NAs to "False"
df['result'] = df['result'].fillna(False)
print(df)

Updated output:

    id                       date  st_1  st_2  lag_st_1  lag_st_2  result
4    4  2021-06-30 00:00:00 00:00  63.3  2.66       NaN       NaN   False
7    5  2021-06-30 00:00:00 00:00  70.0  3.00      63.3      2.66    True
9    5  2021-06-30 00:00:00 00:00  70.0  3.00      70.0      3.00   False
5    4  2021-08-31 00:00:00 00:00  60.0  3.00      70.0      3.00   False
10   5  2021-08-31 00:00:00 00:00  80.0  2.00      60.0      3.00    True
1    2  2021-10-31 00:00:00 00:00  70.0  0.00      80.0      2.00   False
3    3  2021-10-31 00:00:00 00:00  60.0  0.00      70.0      0.00   False
11   5  2021-10-31 00:00:00 00:00  70.0  3.50      60.0      0.00    True
2    2  2021-12-31 00:00:00 00:00  70.0  4.00      70.0      3.50    True
0    1  2022-02-28 00:00:00 00:00  60.0  6.00      70.0      4.00    True
6    4  2022-02-28 00:00:00 00:00  70.0  2.00      60.0      6.00    True
8    4  2022-02-28 00:00:00 00:00  70.0  2.00      70.0      2.00   False

CodePudding user response:

IIUC, you want to check the conditions with the dates, but not have one date tested against a previous date. The logic in C Pappy's answer is better than this but this does the checks only within the date group, therefore it results in fewer 'True'. Please let us know which is correct.

df.sort_values(['date', 'id'], inplace=True)
df['st_1_check'] = False
df['st_2_check'] = False

def test_conditions(x):
    if x.shape[0] > 1:
        x.loc[:, 'st_1_check'] = x['st_1'] - x['st_1'].shift(1)
        x.loc[:, 'st_2_check'] = x['st_2'] - x['st_2'].shift(1)
    return x

dfnew = df.groupby(['date']).apply(test_conditions)
dfnew.fillna(False, inplace=True)
dfnew['st_1_check'] = np.where(dfnew['st_1_check'] > 0, True, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] > 0, True, dfnew['st_2_check'])

dfnew['st_1_check'] = np.where(dfnew['st_1_check'] <= 0, False, dfnew['st_1_check'])
dfnew['st_2_check'] = np.where(dfnew['st_2_check'] <= 0, False, dfnew['st_2_check'])
dfnew

    id                       date     st_1    st_2 st_1_check st_2_check
4    4  2021-06-30 00:00:00 00:00 63.30000 2.66000      False      False
7    5  2021-06-30 00:00:00 00:00 70.00000 3.00000       True       True
9    5  2021-06-30 00:00:00 00:00 70.00000 3.00000      False      False
5    4  2021-08-31 00:00:00 00:00 60.00000 3.00000      False      False
10   5  2021-08-31 00:00:00 00:00 80.00000 2.00000       True      False
1    2  2021-10-31 00:00:00 00:00 70.00000 0.00000      False      False
3    3  2021-10-31 00:00:00 00:00 60.00000 0.00000      False      False
11   5  2021-10-31 00:00:00 00:00 70.00000 3.50000       True       True
2    2  2021-12-31 00:00:00 00:00 70.00000 4.00000      False      False
0    1  2022-02-28 00:00:00 00:00 60.00000 6.00000      False      False
6    4  2022-02-28 00:00:00 00:00 70.00000 2.00000       True      False
8    4  2022-02-28 00:00:00 00:00 70.00000 2.00000      False      False
  • Related