I have a dataframe like this:
import pandas as pd
data = [['A', '2022-09-01', False, 2], ['A', '2022-09-02', False, 4], ['A', '2022-09-03', True, 5],
['A', '2022-09-05', False, 1], ['A', '2022-09-06', False, 2], ['A', '2022-09-07', False, 5],
['A', '2022-09-17', False, 3], ['A', '2022-09-19', True, 2], ['A', '2022-09-20', False, 2],
['B', '2022-09-04', False, 4], ['B', '2022-09-06', True, 1], ['B', '2022-09-16', False, 6],
['B', '2022-09-17', False, 2], ['B', '2022-09-18', True, 4], ['B', '2022-09-19', False, 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])
group date indicator value
0 A 2022-09-01 False 2
1 A 2022-09-02 False 4
2 A 2022-09-03 True 5
3 A 2022-09-05 False 1
4 A 2022-09-06 False 2
5 A 2022-09-07 False 5
6 A 2022-09-17 False 3
7 A 2022-09-19 True 2
8 A 2022-09-20 False 2
9 B 2022-09-04 False 4
10 B 2022-09-06 True 1
11 B 2022-09-16 False 6
12 B 2022-09-17 False 2
13 B 2022-09-18 True 4
14 B 2022-09-19 False 2
I would like to add a column called diff_days
, which shows the difference in days with rows that have indicator = True
and the rest per group. Note that in some groups there could be two rows with indicator = True
, so I would like to have the difference in days with the closest rows. Here you can see the desired output:
data = [['A', '2022-09-01', False, 2, -2], ['A', '2022-09-02', False, 4, -1], ['A', '2022-09-03', True, 5, 0],
['A', '2022-09-05', False, 1, 2], ['A', '2022-09-06', False, 2, 3], ['A', '2022-09-07', False, 5, 4],
['A', '2022-09-17', False, 3, -2], ['A', '2022-09-19', True, 2, 0], ['A', '2022-09-20', False, 2, 1],
['B', '2022-09-04', False, 4, -2], ['B', '2022-09-06', True, 1, 0], ['B', '2022-09-16', False, 6, -2],
['B', '2022-09-17', False, 2, -1], ['B', '2022-09-18', True, 4, 0], ['B', '2022-09-19', False, 2, 1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days'])
group date indicator value diff_days
0 A 2022-09-01 False 2 -2
1 A 2022-09-02 False 4 -1
2 A 2022-09-03 True 5 0
3 A 2022-09-05 False 1 2
4 A 2022-09-06 False 2 3
5 A 2022-09-07 False 5 4
6 A 2022-09-17 False 3 -2
7 A 2022-09-19 True 2 0
8 A 2022-09-20 False 2 1
9 B 2022-09-04 False 4 -2
10 B 2022-09-06 True 1 0
11 B 2022-09-16 False 6 -2
12 B 2022-09-17 False 2 -1
13 B 2022-09-18 True 4 0
14 B 2022-09-19 False 2 1
As you can see it depends on how close the date
in days is with the date
where indicator = True
. So I was wondering if anyone knows how to calculate the difference in days with the closest days with a condition per group?
CodePudding user response:
You can go with merge_asof. perform it on df with only the rows with True in indicator column. Then you can calculate the difference in days within the same group with the nearest True date of the group.
# need datetime column
df['date'] = pd.to_datetime(df['date'])
res = (
pd.merge_asof(df.sort_values('date'),
df.loc[df['indicator'], ['group','date']].sort_values('date')
.assign(diff_days=lambda x: x['date']),
by='group', on='date', direction='nearest')
.assign(diff_days=lambda x: (x['date']-x['diff_days']).dt.days)
.sort_values(['group','date'])
.reset_index(drop=True)
)
print(res)
# group date indicator value diff_days
# 0 A 2022-09-01 False 2 -2
# 1 A 2022-09-02 False 4 -1
# 2 A 2022-09-03 True 5 0
# 3 A 2022-09-05 False 1 2
# 4 A 2022-09-06 False 2 3
# 5 A 2022-09-07 False 5 4
# 6 A 2022-09-17 False 3 -2
# 7 A 2022-09-19 True 2 0
# 8 A 2022-09-20 False 2 1
# 9 B 2022-09-04 False 4 -2
# 10 B 2022-09-06 True 1 0
# 11 B 2022-09-16 False 6 -2
# 12 B 2022-09-17 False 2 -1
# 13 B 2022-09-18 True 4 0
# 14 B 2022-09-19 False 2 1