Home > Software engineering >  Calculate Difference in Days with Closest rows with certain Condition using Pandas
Calculate Difference in Days with Closest rows with certain Condition using Pandas

Time:09-28

I have a dataframe like this:

import pandas as pd

data = [['A', '2022-09-01', False, 2], ['A', '2022-09-02', False, 4], ['A', '2022-09-03', True, 5],
        ['A', '2022-09-05', False, 1], ['A', '2022-09-06', False, 2], ['A', '2022-09-07', False, 5],
        ['A', '2022-09-17', False, 3], ['A', '2022-09-19', True, 2], ['A', '2022-09-20', False, 2],
        ['B', '2022-09-04', False, 4], ['B', '2022-09-06', True, 1], ['B', '2022-09-16', False, 6],
        ['B', '2022-09-17', False, 2], ['B', '2022-09-18', True, 4], ['B', '2022-09-19', False, 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

   group        date  indicator  value
0      A  2022-09-01      False      2
1      A  2022-09-02      False      4
2      A  2022-09-03       True      5
3      A  2022-09-05      False      1
4      A  2022-09-06      False      2
5      A  2022-09-07      False      5
6      A  2022-09-17      False      3
7      A  2022-09-19       True      2
8      A  2022-09-20      False      2
9      B  2022-09-04      False      4
10     B  2022-09-06       True      1
11     B  2022-09-16      False      6
12     B  2022-09-17      False      2
13     B  2022-09-18       True      4
14     B  2022-09-19      False      2

I would like to add a column called diff_days, which shows the difference in days with rows that have indicator = True and the rest per group. Note that in some groups there could be two rows with indicator = True, so I would like to have the difference in days with the closest rows. Here you can see the desired output:

data = [['A', '2022-09-01', False, 2, -2], ['A', '2022-09-02', False, 4, -1], ['A', '2022-09-03', True, 5, 0],
        ['A', '2022-09-05', False, 1, 2], ['A', '2022-09-06', False, 2, 3], ['A', '2022-09-07', False, 5, 4],
        ['A', '2022-09-17', False, 3, -2], ['A', '2022-09-19', True, 2, 0], ['A', '2022-09-20', False, 2, 1],
        ['B', '2022-09-04', False, 4, -2], ['B', '2022-09-06', True, 1, 0], ['B', '2022-09-16', False, 6, -2],
        ['B', '2022-09-17', False, 2, -1], ['B', '2022-09-18', True, 4, 0], ['B', '2022-09-19', False, 2, 1]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value', 'diff_days'])

   group        date  indicator  value  diff_days
0      A  2022-09-01      False      2         -2
1      A  2022-09-02      False      4         -1
2      A  2022-09-03       True      5          0
3      A  2022-09-05      False      1          2
4      A  2022-09-06      False      2          3
5      A  2022-09-07      False      5          4
6      A  2022-09-17      False      3         -2
7      A  2022-09-19       True      2          0
8      A  2022-09-20      False      2          1
9      B  2022-09-04      False      4         -2
10     B  2022-09-06       True      1          0
11     B  2022-09-16      False      6         -2
12     B  2022-09-17      False      2         -1
13     B  2022-09-18       True      4          0
14     B  2022-09-19      False      2          1

As you can see it depends on how close the date in days is with the date where indicator = True. So I was wondering if anyone knows how to calculate the difference in days with the closest days with a condition per group?

CodePudding user response:

You can go with merge_asof. perform it on df with only the rows with True in indicator column. Then you can calculate the difference in days within the same group with the nearest True date of the group.

# need datetime column
df['date'] = pd.to_datetime(df['date'])

res = (
    pd.merge_asof(df.sort_values('date'), 
                  df.loc[df['indicator'], ['group','date']].sort_values('date')
                    .assign(diff_days=lambda x: x['date']), 
                  by='group', on='date', direction='nearest')
      .assign(diff_days=lambda x: (x['date']-x['diff_days']).dt.days)
      .sort_values(['group','date'])
      .reset_index(drop=True)
)
print(res)
#    group       date  indicator  value  diff_days
# 0      A 2022-09-01      False      2         -2
# 1      A 2022-09-02      False      4         -1
# 2      A 2022-09-03       True      5          0
# 3      A 2022-09-05      False      1          2
# 4      A 2022-09-06      False      2          3
# 5      A 2022-09-07      False      5          4
# 6      A 2022-09-17      False      3         -2
# 7      A 2022-09-19       True      2          0
# 8      A 2022-09-20      False      2          1
# 9      B 2022-09-04      False      4         -2
# 10     B 2022-09-06       True      1          0
# 11     B 2022-09-16      False      6         -2
# 12     B 2022-09-17      False      2         -1
# 13     B 2022-09-18       True      4          0
# 14     B 2022-09-19      False      2          1
  • Related