Home > Software engineering >  Calculate the slope for every n days per group
Calculate the slope for every n days per group

Time:10-12

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01', 2], ['A', '2022-09-02', 1], ['A', '2022-09-04', 3], ['A', '2022-09-06', 2],
        ['A', '2022-09-07', 1], ['A', '2022-09-07', 2], ['A', '2022-09-08', 4], ['A', '2022-09-09', 2],
        ['B', '2022-09-01', 2], ['B', '2022-09-03', 4], ['B', '2022-09-04', 2], ['B', '2022-09-05', 2],
        ['B', '2022-09-07', 1], ['B', '2022-09-08', 3], ['B', '2022-09-10', 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'value'])

df['date'] = pd.to_datetime(df['date'])
df['diff_days'] = (df['date']-df['date'].groupby(df['group']).transform('first')).dt.days

   group       date  value  diff_days
0      A 2022-09-01      2          0
1      A 2022-09-02      1          1
2      A 2022-09-04      3          3
3      A 2022-09-06      2          5
4      A 2022-09-07      1          6
5      A 2022-09-07      2          6
6      A 2022-09-08      4          7
7      A 2022-09-09      2          8
8      B 2022-09-01      2          0
9      B 2022-09-03      4          2
10     B 2022-09-04      2          3
11     B 2022-09-05      2          4
12     B 2022-09-07      1          6
13     B 2022-09-08      3          7
14     B 2022-09-10      2          9

I would like to create a column called "slope" which shows the slope for every n (n = 3) days per group. This means that when the first date is "2022-09-01" and 3 days later are used for the calculation. The slope can be calculated using the "diff_days" (calculated by difference with the first value per group) and "value" columns. Here is the desired output:

data = [['A', '2022-09-01', 2, 0, 0.43], ['A', '2022-09-02', 1, 1, 0.43], ['A', '2022-09-04', 3, 3, 0.43], ['A', '2022-09-06', 2, 5, -0.5],
        ['A', '2022-09-07', 1, 6, -0.5], ['A', '2022-09-07', 2, 6, -0.5], ['A', '2022-09-08', 4, 7, -2], ['A', '2022-09-09', 2, 8, -2],
        ['B', '2022-09-01', 2, 0, 0.14], ['B', '2022-09-03', 4, 2, 0.14], ['B', '2022-09-04', 2, 3, 0.14], ['B', '2022-09-05', 2, 4, -0.5],
        ['B', '2022-09-07', 1, 6, -0.5], ['B', '2022-09-08', 3, 7, -0.5], ['B', '2022-09-10', 2, 9, -0.5]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'value', 'diff_days', 'slope'])

   group        date  value  diff_days  slope
0      A  2022-09-01      2          0   0.43
1      A  2022-09-02      1          1   0.43
2      A  2022-09-04      3          3   0.43
3      A  2022-09-06      2          5  -0.50
4      A  2022-09-07      1          6  -0.50
5      A  2022-09-07      2          6  -0.50
6      A  2022-09-08      4          7  -2.00
7      A  2022-09-09      2          8  -2.00
8      B  2022-09-01      2          0   0.14
9      B  2022-09-03      4          2   0.14
10     B  2022-09-04      2          3   0.14
11     B  2022-09-05      2          4  -0.50
12     B  2022-09-07      1          6  -0.50
13     B  2022-09-08      3          7  -0.50
14     B  2022-09-10      2          9  -0.50

Here are some example calculations to give you an idea:

  • For the first 3 days of group A: slope([0,1,3],[2,1,3])=0.43
  • For the 3 days later of group A: slope([5,6,6],[2,1,2])=-0.5
  • For again 3 days later of group A: slope([7,8],[4,2])=-2.0

So I was wondering if anyone knows how to determine the slope for every n days (this case 3 days) per group? Please note: Not all dates are included, so it is really every n days.

CodePudding user response:

so you need to:

  1. split each group into groups of n elements (or less at end) - use Numpy array_split
  2. calculate the slope - use Numpy polyfit
  3. append n times (or less at end)

here goes:

n = 3
slopes = []
for k, g in df.groupby('group'):
    a = np.array_split(g['diff_days'].values, n)
    b = np.array_split(g['value'].values, n)
    for ab in zip(a,b):
        for x in ab[0]:
            slopes.append(np.polyfit(ab[0], ab[1], 1)[0].round(2))

df['slopes'] = slopes

CodePudding user response:

A possible solution, using pandas groupby, transform and apply:

# size of the blocks
n = 3

# this is to form blocks of 3 elements for each group
df['blk'] = df.groupby('group')['date'].transform(
    lambda x: np.repeat(range(int(np.ceil(len(x)/n))), n)[range(len(x))])

# this function calculates the slopes for each block of 3 
def f(x):
    return x.assign(slope = np.polyfit(x['diff_days'], x['value'], 1)[0])

df.groupby(['group', 'blk'], group_keys=False).apply(f).drop('blk', axis=1)

Output:

   group       date  value  diff_days     slope
0      A 2022-09-01      2          0  0.428571
1      A 2022-09-02      1          1  0.428571
2      A 2022-09-04      3          3  0.428571
3      A 2022-09-06      2          5 -0.500000
4      A 2022-09-07      1          6 -0.500000
5      A 2022-09-07      2          6 -0.500000
6      A 2022-09-08      4          7 -2.000000
7      A 2022-09-09      2          8 -2.000000
8      B 2022-09-01      2          0  0.142857
9      B 2022-09-03      4          2  0.142857
10     B 2022-09-04      2          3  0.142857
11     B 2022-09-05      2          4  0.214286
12     B 2022-09-07      1          6  0.214286
13     B 2022-09-08      3          7  0.214286
14     B 2022-09-10      2          9  0.111111
  • Related