Calculating change in score and adding to a new column pandas-CodePudding

I am looking to calculate the changed in mental health scores of each individual between two timepoints.

Each user has a name, and a mental health score from 3 different timepoints. I would like to calculate the change in mental health score between timepoint 3 and 1

Below is example of df I'm starting with:

User   Timepoint   Mental Health Score
Bill       1               5
Bill       2               10
Bill       3               15
Wiz        1               10
Wiz        2               10
Wiz        3               15
Sam        1               5
Sam        2               5
Sam        3               5

This is desired output:

User   Timepoint   Mental Health Score      Change in Mental Health (TP1 and 3) 
Bill       1               5                               
Bill       2               10
Bill       3               15                              10
Wiz        1               10
Wiz        2               10
Wiz        3               15                              5
Sam        1               5
Sam        2               5
Sam        3               5                               0

Does anyone know how to do this?

CodePudding user response：

You can accomplish this using shift() and np.where()

df['Change in Mental Health (TP1 and 3)'] = df['Mental Health Score'] - df['Mental Health Score'].shift(2)
df['Change in Mental Health (TP1 and 3)'] = np.where(df['Timepoint'] != 3, 0, df['Change in Mental Health (TP1 and 3)']).astype(int)
df

CodePudding user response：

Try with groupby and where:

#sort by Timepoint if needed
#df = df.sort_values("Timepoint")

changes = df.groupby("User")["Mental Health Score"].transform('last')-df.groupby("User")["Mental Health Score"].transform('first')
df["Change"] = changes.where(df["Timepoint"].eq(3))

>>> df
   User  Timepoint  Mental Health Score  Change
0  Bill          1                    5     NaN
1  Bill          2                   10     NaN
2  Bill          3                   15    10.0
3   Wiz          1                   10     NaN
4   Wiz          2                   10     NaN
5   Wiz          3                   15     5.0
6   Sam          1                    5     NaN
7   Sam          2                    5     NaN
8   Sam          3                    5     0.0

CodePudding user response：

As it is stated already in the comments you can groupby your dataframe on User and calculate difference on Mental Health Score

I put a snippet code here to demonstrate

def _overall_change(scores):
  return scores.iloc[-1] - scores.iloc[0]

person = df.groupby('User')['Score'].agg(_overall_change)

CodePudding user response：

Using groupby and a merge:

g = df.sort_values(by='Timepoint').groupby('User')['Mental Health Score']

s = pd.concat({3: g.last()-g.first()})
#    User
# 3  Bill    10
#    Sam      0
#    Wiz      5
# Name: Mental Health Score, dtype: int64

df.merge(s, left_on=['Timepoint', 'User'], right_index=True, how='left')

output:

   User  Timepoint  Mental Health Score_x  Mental Health Score_y
0  Bill          1                      5                    NaN
1  Bill          2                     10                    NaN
2  Bill          3                     15                   10.0
3   Wiz          1                     10                    NaN
4   Wiz          2                     10                    NaN
5   Wiz          3                     15                    5.0
6   Sam          1                      5                    NaN
7   Sam          2                      5                    NaN
8   Sam          3                      5                    0.0

CodePudding user response：

Here's another possible solution:

import pandas as pd

def calculate_change(mhs):
    mhs = list(mhs)
    return mhs[-1] - mhs[0]

df = df.sort_values(["User", "Timepoint"])
diff = df.groupby('User')['Mental Health Score'].agg(calculate_change)

df = pd.merge(df, diff, how='left', left_on='User', right_index=True)
df.columns = ['User', 'Timepoint', 'Mental Health Score', 'Change']
df['Change'] = df['Change'].loc[df['Timepoint']==3]
print(df)

Output

   User  Timepoint  Mental Health Score  Change
0  Bill          1                    5     NaN
1  Bill          2                   10     NaN
2  Bill          3                   15    10.0
3   Wiz          1                   10     NaN
4   Wiz          2                   10     NaN
5   Wiz          3                   15     5.0
6   Sam          1                    5     NaN
7   Sam          2                    5     NaN
8   Sam          3                    5     0.0