Home > Back-end >  Calculate the slope for every n days per group with respect to row with condition
Calculate the slope for every n days per group with respect to row with condition

Time:10-15

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01', False, 2], ['A', '2022-09-02', False, 1], ['A', '2022-09-03', False, 1], ['A', '2022-09-04', True, 3], 
        ['A', '2022-09-05', False, 3], ['A', '2022-09-06', False, 2], ['A', '2022-09-07', False, 1], ['A', '2022-09-07', False, 2], 
        ['A', '2022-09-08', False, 4], ['A', '2022-09-09', False, 2],
        ['B', '2022-09-01', False, 2], ['B', '2022-09-02', False, 2], ['B', '2022-09-03', False, 4], ['B', '2022-09-04', False, 2], 
        ['B', '2022-09-05', True, 2], ['B', '2022-09-06', False, 2], ['B', '2022-09-07', False, 1], ['B', '2022-09-08', False, 3], 
        ['B', '2022-09-09', False, 3], ['B', '2022-09-10', False, 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

# Add diff_days which is difference in days with closest row with True condition per group
df['date'] = pd.to_datetime(df['date'])

df = (
    pd.merge_asof(df.sort_values('date'), 
                  df.loc[df['indicator'], ['group','date']].sort_values('date')
                    .assign(diff_days=lambda x: x['date']), 
                  by='group', on='date', direction='nearest')
      .assign(diff_days=lambda x: (x['date']-x['diff_days']).dt.days)
      .sort_values(['group','date'])
      .reset_index(drop=True)
)

   group       date  indicator  value  diff_days
0      A 2022-09-01      False      2         -3
1      A 2022-09-02      False      1         -2
2      A 2022-09-03      False      1         -1
3      A 2022-09-04       True      3          0
4      A 2022-09-05      False      3          1
5      A 2022-09-06      False      2          2
6      A 2022-09-07      False      2          3
7      A 2022-09-07      False      1          3
8      A 2022-09-08      False      4          4
9      A 2022-09-09      False      2          5
10     B 2022-09-01      False      2         -4
11     B 2022-09-02      False      2         -3
12     B 2022-09-03      False      4         -2
13     B 2022-09-04      False      2         -1
14     B 2022-09-05       True      2          0
15     B 2022-09-06      False      2          1
16     B 2022-09-07      False      1          2
17     B 2022-09-08      False      3          3
18     B 2022-09-09      False      3          4
19     B 2022-09-10      False      2          5

I would like to add a column called "slope" which returns the slope for n days (this case n = 3) with respect to the rows with condition "indicator = True" per group. Here is the desired output:

data = [['A', '2022-09-01', False, 2, -3, -0.5], ['A', '2022-09-02', False, 1, -2, -0.5], ['A', '2022-09-03', False, 1, -1, -0.5], ['A', '2022-09-04', True, 3, 0, 0], 
        ['A', '2022-09-05', False, 3, 1, -0.5], ['A', '2022-09-06', False, 2, 2, -0.5], ['A', '2022-09-07', False, 2, 3, -0.5], ['A', '2022-09-07', False, 1, 3, 0.5], 
        ['A', '2022-09-08', False, 4, 4, 0.5], ['A', '2022-09-09', False, 2, 5, 0.5],
        ['B', '2022-09-01', False, 2, -4], ['B', '2022-09-02', False, 2, -3, 0], ['B', '2022-09-03', False, 4, -2, 0], ['B', '2022-09-04', False, 2, -1, 0], 
        ['B', '2022-09-05', True, 2, 0, 0], ['B', '2022-09-06', False, 2, 1, 0.5], ['B', '2022-09-07', False, 1, 2, 0.5], ['B', '2022-09-08', False, 3, 3, 0.5], 
        ['B', '2022-09-09', False, 3, 4, -1], ['B', '2022-09-10', False, 2, 5, -1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days', 'slope'])

   group        date  indicator  value  diff_days  slope
0      A  2022-09-01      False      2         -3   -0.5
1      A  2022-09-02      False      1         -2   -0.5
2      A  2022-09-03      False      1         -1   -0.5
3      A  2022-09-04       True      3          0    0.0
4      A  2022-09-05      False      3          1   -0.5
5      A  2022-09-06      False      2          2   -0.5
6      A  2022-09-07      False      2          3   -0.5
7      A  2022-09-07      False      1          3    0.5
8      A  2022-09-08      False      4          4    0.5
9      A  2022-09-09      False      2          5    0.5
10     B  2022-09-01      False      2         -4    NaN
11     B  2022-09-02      False      2         -3    0.0
12     B  2022-09-03      False      4         -2    0.0
13     B  2022-09-04      False      2         -1    0.0
14     B  2022-09-05       True      2          0    0.0
15     B  2022-09-06      False      2          1    0.5
16     B  2022-09-07      False      1          2    0.5
17     B  2022-09-08      False      3          3    0.5
18     B  2022-09-09      False      3          4   -1.0
19     B  2022-09-10      False      2          5   -1.0

Let's explain the calculations of group B. The slope (using "diff_days" as x values) are calculated for n = 3 with respect to the row with indicator == True, which is row 15 in dataframe:

  • For rows 12,13,14 the slope is: linregress([-3,-2,-1],[2,4,2])=0
  • Row 11 is alone because it doesn't fit in the 3 days range with respect to the certain row (indicator==True) which means: linregress([-4],[2])=NaN
  • For rows 16,17,18 the slope is: linregress([1,2,3],[2,1,3])=0.5
  • For rows 19,20 the slope is: linregress([4,5],[3,2])=-1.0

Please note: the rows with condition (indicator==True) should have a slope value of 0.

So, I was wondering if anyone knows how to calculate the slopes of n days with respect to a certain row per group using pandas?

CodePudding user response:

I'm a bit confused by your example - is there a mistake in 10 (what you call Row 11?) If I understand correctly, what you want is to count each group as new after either a) indicator becomes True or b) every 3 rows. This can be done like this:

from scipy.stats import linregress

def count_every_n(grp, n):
    return pd.Series([k // n for k in range(len(grp))])

def get_slope(grp):
    return pd.Series(linregress(grp.diff_days, grp.value).slope, index=grp.index)

indicator_change = (df.indicator != df.indicator.shift()).cumsum()

every_n_within_groups = (df
                         .groupby(indicator_change, group_keys=False)
                         .apply(lambda g: count_every_n(g, n=3))
                         .reset_index(drop=True))

df['slope'] = (df
               .groupby([indicator_change, every_n_within_groups])
               .apply(get_slope)
               .reset_index(drop=True)
               .fillna(0))
  • Related