I have the following dataframe (sample):
import pandas as pd
data = [['A', '2022-09-01', False, 2], ['A', '2022-09-02', False, 1], ['A', '2022-09-03', False, 1], ['A', '2022-09-04', True, 3],
['A', '2022-09-05', False, 3], ['A', '2022-09-06', False, 2], ['A', '2022-09-07', False, 1], ['A', '2022-09-07', False, 2],
['A', '2022-09-08', False, 4], ['A', '2022-09-09', False, 2],
['B', '2022-09-01', False, 2], ['B', '2022-09-02', False, 2], ['B', '2022-09-03', False, 4], ['B', '2022-09-04', False, 2],
['B', '2022-09-05', True, 2], ['B', '2022-09-06', False, 2], ['B', '2022-09-07', False, 1], ['B', '2022-09-08', False, 3],
['B', '2022-09-09', False, 3], ['B', '2022-09-10', False, 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
# Add diff_days which is difference in days with closest row with True condition per group
df['date'] = pd.to_datetime(df['date'])
df = (
pd.merge_asof(df.sort_values('date'),
df.loc[df['indicator'], ['group','date']].sort_values('date')
.assign(diff_days=lambda x: x['date']),
by='group', on='date', direction='nearest')
.assign(diff_days=lambda x: (x['date']-x['diff_days']).dt.days)
.sort_values(['group','date'])
.reset_index(drop=True)
)
group date indicator value diff_days
0 A 2022-09-01 False 2 -3
1 A 2022-09-02 False 1 -2
2 A 2022-09-03 False 1 -1
3 A 2022-09-04 True 3 0
4 A 2022-09-05 False 3 1
5 A 2022-09-06 False 2 2
6 A 2022-09-07 False 2 3
7 A 2022-09-07 False 1 3
8 A 2022-09-08 False 4 4
9 A 2022-09-09 False 2 5
10 B 2022-09-01 False 2 -4
11 B 2022-09-02 False 2 -3
12 B 2022-09-03 False 4 -2
13 B 2022-09-04 False 2 -1
14 B 2022-09-05 True 2 0
15 B 2022-09-06 False 2 1
16 B 2022-09-07 False 1 2
17 B 2022-09-08 False 3 3
18 B 2022-09-09 False 3 4
19 B 2022-09-10 False 2 5
I would like to add a column called "slope" which returns the slope for n days (this case n = 3) with respect to the rows with condition "indicator = True" per group. Here is the desired output:
data = [['A', '2022-09-01', False, 2, -3, -0.5], ['A', '2022-09-02', False, 1, -2, -0.5], ['A', '2022-09-03', False, 1, -1, -0.5], ['A', '2022-09-04', True, 3, 0, 0],
['A', '2022-09-05', False, 3, 1, -0.5], ['A', '2022-09-06', False, 2, 2, -0.5], ['A', '2022-09-07', False, 2, 3, -0.5], ['A', '2022-09-07', False, 1, 3, 0.5],
['A', '2022-09-08', False, 4, 4, 0.5], ['A', '2022-09-09', False, 2, 5, 0.5],
['B', '2022-09-01', False, 2, -4], ['B', '2022-09-02', False, 2, -3, 0], ['B', '2022-09-03', False, 4, -2, 0], ['B', '2022-09-04', False, 2, -1, 0],
['B', '2022-09-05', True, 2, 0, 0], ['B', '2022-09-06', False, 2, 1, 0.5], ['B', '2022-09-07', False, 1, 2, 0.5], ['B', '2022-09-08', False, 3, 3, 0.5],
['B', '2022-09-09', False, 3, 4, -1], ['B', '2022-09-10', False, 2, 5, -1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days', 'slope'])
group date indicator value diff_days slope
0 A 2022-09-01 False 2 -3 -0.5
1 A 2022-09-02 False 1 -2 -0.5
2 A 2022-09-03 False 1 -1 -0.5
3 A 2022-09-04 True 3 0 0.0
4 A 2022-09-05 False 3 1 -0.5
5 A 2022-09-06 False 2 2 -0.5
6 A 2022-09-07 False 2 3 -0.5
7 A 2022-09-07 False 1 3 0.5
8 A 2022-09-08 False 4 4 0.5
9 A 2022-09-09 False 2 5 0.5
10 B 2022-09-01 False 2 -4 NaN
11 B 2022-09-02 False 2 -3 0.0
12 B 2022-09-03 False 4 -2 0.0
13 B 2022-09-04 False 2 -1 0.0
14 B 2022-09-05 True 2 0 0.0
15 B 2022-09-06 False 2 1 0.5
16 B 2022-09-07 False 1 2 0.5
17 B 2022-09-08 False 3 3 0.5
18 B 2022-09-09 False 3 4 -1.0
19 B 2022-09-10 False 2 5 -1.0
Let's explain the calculations of group B. The slope (using "diff_days" as x values) are calculated for n = 3 with respect to the row with indicator == True, which is row 15 in dataframe:
- For rows 12,13,14 the slope is: linregress([-3,-2,-1],[2,4,2])=0
- Row 11 is alone because it doesn't fit in the 3 days range with respect to the certain row (indicator==True) which means: linregress([-4],[2])=NaN
- For rows 16,17,18 the slope is: linregress([1,2,3],[2,1,3])=0.5
- For rows 19,20 the slope is: linregress([4,5],[3,2])=-1.0
Please note: the rows with condition (indicator==True) should have a slope value of 0.
So, I was wondering if anyone knows how to calculate the slopes of n days with respect to a certain row per group using pandas
?
CodePudding user response:
I'm a bit confused by your example - is there a mistake in 10 (what you call Row 11?) If I understand correctly, what you want is to count each group as new after either a) indicator
becomes True or b) every 3 rows. This can be done like this:
from scipy.stats import linregress
def count_every_n(grp, n):
return pd.Series([k // n for k in range(len(grp))])
def get_slope(grp):
return pd.Series(linregress(grp.diff_days, grp.value).slope, index=grp.index)
indicator_change = (df.indicator != df.indicator.shift()).cumsum()
every_n_within_groups = (df
.groupby(indicator_change, group_keys=False)
.apply(lambda g: count_every_n(g, n=3))
.reset_index(drop=True))
df['slope'] = (df
.groupby([indicator_change, every_n_within_groups])
.apply(get_slope)
.reset_index(drop=True)
.fillna(0))