I have the following dataframe (sample):
import pandas as pd
data = [['A', '2022-09-01', 2], ['A', '2022-09-02', 1], ['A', '2022-09-04', 3], ['A', '2022-09-06', 2],
['A', '2022-09-07', 1], ['A', '2022-09-07', 2], ['A', '2022-09-08', 4], ['A', '2022-09-09', 2],
['B', '2022-09-01', 2], ['B', '2022-09-03', 4], ['B', '2022-09-04', 2], ['B', '2022-09-05', 2],
['B', '2022-09-07', 1], ['B', '2022-09-08', 3], ['B', '2022-09-10', 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'value'])
df['date'] = pd.to_datetime(df['date'])
df['diff_days'] = (df['date']-df['date'].groupby(df['group']).transform('first')).dt.days
group date value diff_days
0 A 2022-09-01 2 0
1 A 2022-09-02 1 1
2 A 2022-09-04 3 3
3 A 2022-09-06 2 5
4 A 2022-09-07 1 6
5 A 2022-09-07 2 6
6 A 2022-09-08 4 7
7 A 2022-09-09 2 8
8 B 2022-09-01 2 0
9 B 2022-09-03 4 2
10 B 2022-09-04 2 3
11 B 2022-09-05 2 4
12 B 2022-09-07 1 6
13 B 2022-09-08 3 7
14 B 2022-09-10 2 9
I would like to create a column called "slope" which shows the slope for every n (n = 3) days per group. This means that when the first date is "2022-09-01" and 3 days later are used for the calculation. The slope can be calculated using the "diff_days" (calculated by difference with the first value per group) and "value" columns. Here is the desired output:
data = [['A', '2022-09-01', 2, 0, 0.43], ['A', '2022-09-02', 1, 1, 0.43], ['A', '2022-09-04', 3, 3, 0.43], ['A', '2022-09-06', 2, 5, -0.5],
['A', '2022-09-07', 1, 6, -0.5], ['A', '2022-09-07', 2, 6, -0.5], ['A', '2022-09-08', 4, 7, -2], ['A', '2022-09-09', 2, 8, -2],
['B', '2022-09-01', 2, 0, 0.14], ['B', '2022-09-03', 4, 2, 0.14], ['B', '2022-09-04', 2, 3, 0.14], ['B', '2022-09-05', 2, 4, -0.5],
['B', '2022-09-07', 1, 6, -0.5], ['B', '2022-09-08', 3, 7, -0.5], ['B', '2022-09-10', 2, 9, -0.5]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'value', 'diff_days', 'slope'])
group date value diff_days slope
0 A 2022-09-01 2 0 0.43
1 A 2022-09-02 1 1 0.43
2 A 2022-09-04 3 3 0.43
3 A 2022-09-06 2 5 -0.50
4 A 2022-09-07 1 6 -0.50
5 A 2022-09-07 2 6 -0.50
6 A 2022-09-08 4 7 -2.00
7 A 2022-09-09 2 8 -2.00
8 B 2022-09-01 2 0 0.14
9 B 2022-09-03 4 2 0.14
10 B 2022-09-04 2 3 0.14
11 B 2022-09-05 2 4 -0.50
12 B 2022-09-07 1 6 -0.50
13 B 2022-09-08 3 7 -0.50
14 B 2022-09-10 2 9 -0.50
Here are some example calculations to give you an idea:
- For the first 3 days of group A: slope([0,1,3],[2,1,3])=0.43
- For the 3 days later of group A: slope([5,6,6],[2,1,2])=-0.5
- For again 3 days later of group A: slope([7,8],[4,2])=-2.0
So I was wondering if anyone knows how to determine the slope for every n days (this case 3 days) per group? Please note: Not all dates are included, so it is really every n days.
CodePudding user response:
so you need to:
- split each group into groups of n elements (or less at end) - use Numpy array_split
- calculate the slope - use Numpy polyfit
- append n times (or less at end)
here goes:
n = 3
slopes = []
for k, g in df.groupby('group'):
a = np.array_split(g['diff_days'].values, n)
b = np.array_split(g['value'].values, n)
for ab in zip(a,b):
for x in ab[0]:
slopes.append(np.polyfit(ab[0], ab[1], 1)[0].round(2))
df['slopes'] = slopes
CodePudding user response:
A possible solution, using pandas groupby
, transform
and apply
:
# size of the blocks
n = 3
# this is to form blocks of 3 elements for each group
df['blk'] = df.groupby('group')['date'].transform(
lambda x: np.repeat(range(int(np.ceil(len(x)/n))), n)[range(len(x))])
# this function calculates the slopes for each block of 3
def f(x):
return x.assign(slope = np.polyfit(x['diff_days'], x['value'], 1)[0])
df.groupby(['group', 'blk'], group_keys=False).apply(f).drop('blk', axis=1)
Output:
group date value diff_days slope
0 A 2022-09-01 2 0 0.428571
1 A 2022-09-02 1 1 0.428571
2 A 2022-09-04 3 3 0.428571
3 A 2022-09-06 2 5 -0.500000
4 A 2022-09-07 1 6 -0.500000
5 A 2022-09-07 2 6 -0.500000
6 A 2022-09-08 4 7 -2.000000
7 A 2022-09-09 2 8 -2.000000
8 B 2022-09-01 2 0 0.142857
9 B 2022-09-03 4 2 0.142857
10 B 2022-09-04 2 3 0.142857
11 B 2022-09-05 2 4 0.214286
12 B 2022-09-07 1 6 0.214286
13 B 2022-09-08 3 7 0.214286
14 B 2022-09-10 2 9 0.111111