Create New Column Taking Slope of Values-CodePudding

I have the following dataframe of students with their exam scores in different dates:

df = pd.DataFrame({'student': 'A A A B B B B'.split(),
                  'exam_date': pd.date_range(start='1/1/2020', periods=7, freq='M'),
                  'score': [15, 28, 17, 22, 43, 40, 52]})

print(df)

  student  exam_date  score
0       A 2020-01-31     15
1       A 2020-02-29     28
2       A 2020-03-31     17
3       B 2020-04-30     22
4       B 2020-05-31     43
5       B 2020-06-30     40
6       B 2020-07-31     52

I need to create a new column -- the slope of scores of each student.

I'm trying the following script:

df.exam_date = pd.to_datetime(df.exam_date)

df['date_ordinal'] = pd.to_datetime(df['exam_date']).map(dt.datetime.toordinal)

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(df['date_ordinal'], df['score'])

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

Desired output:

  student  exam_date  score     slope
0       A 2020-01-31     15     0.028
1       A 2020-02-29     28     0.028
2       A 2020-03-31     17     0.028
3       B 2020-04-30     22     0.285
4       B 2020-05-31     43     0.285
5       B 2020-06-30     40     0.285
6       B 2020-07-31     52     0.285

CodePudding user response：

You can use a groupby / apply pattern to compute the slope for each student in one call, and then use merge or join to attach back onto your original dataframe.

Step 1: Define function to compute slope for each student You have essentially already done this in your question:

import pandas as pd
from scipy import stats

def get_slope(df_student):
    """
    Assumes df_student has the columns 'score' and 'date_ordinal' available
    """
    results = stats.linregress(df_student['date_ordinal'], df_student['score'])
    return results.slope

Step 2: Compute the slope for each student with groupby

slopes = df.groupby('student').apply(get_slope).rename('slope')

slopes object is a series indexed by student:

Step 3: Join back to original dataframe

Here we can either use join or merge. The main difference between join and merge is that merge is more flexible with the ability to join on both row indexes and columns, while join is more concise and designed to join in row indexes only. To use join here, the index of the original dataframe would need to change to student. So instead, use the merge method:

df_final = df.merge(slopes, left_on=['student'], right_index=True).drop('date_ordinal', axis=1)

Now df_final should be your desired output:

Notes:

You can achieve the same result more concisely if you are willing to use student as the index of your original dataframe as follows:

df = pd.DataFrame({'student': 'A A A B B B B'.split(),
                  'exam_date': pd.date_range(start='1/1/2020', periods=7, freq='M'),
                  'score': [15, 28, 17, 22, 43, 40, 52]})
df['date_ordinal'] = pd.to_datetime(df['exam_date']).map(dt.datetime.toordinal)

df.set_index(['student'], inplace=True, append=False)
df['slope'] = df.groupby('student')\
    .apply(lambda x: stats.linregress(x['date_ordinal'], x['score']).slope)\
    .drop('date_ordinal', axis=1)