Home > other >  Calculate difference in days with it most centered date per group using Pandas
Calculate difference in days with it most centered date per group using Pandas

Time:10-05

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01', 2], ['A', '2022-09-02', 1], ['A', '2022-09-04', 3], ['A', '2022-09-06', 2],
        ['A', '2022-09-07', 1], ['A', '2022-09-07', 3], ['A', '2022-09-08', 4], ['A', '2022-09-08', 2],
        ['B', '2022-09-01', 2], ['B', '2022-09-03', 4], ['B', '2022-09-04', 2], ['B', '2022-09-05', 2],
        ['B', '2022-09-07', 1], ['B', '2022-09-08', 1], ['B', '2022-09-10', 1]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'value'])

   group        date  value
0      A  2022-09-01      2
1      A  2022-09-02      1
2      A  2022-09-04      3
3      A  2022-09-06      2
4      A  2022-09-07      1
5      A  2022-09-07      3
6      A  2022-09-08      4
7      A  2022-09-08      2
8      B  2022-09-01      2
9      B  2022-09-03      4
10     B  2022-09-04      2
11     B  2022-09-05      2
12     B  2022-09-07      1
13     B  2022-09-08      1
14     B  2022-09-10      1

I would like to create the column "diff_days" which shows the difference in days with its most centered date per group. The centered date should have a value of 0 in the diff_days column. Here is the desired output:

data = [['A', '2022-09-01', 2, -3], ['A', '2022-09-02', 1, -2], ['A', '2022-09-04', 3, 0], ['A', '2022-09-06', 2, 2],
        ['A', '2022-09-07', 1, 3], ['A', '2022-09-07', 3, 3], ['A', '2022-09-08', 4, 4], ['A', '2022-09-08', 2, 4],
        ['B', '2022-09-01', 2, -4], ['B', '2022-09-03', 4, -2], ['B', '2022-09-04', 2, -1], ['B', '2022-09-05', 2, 0],
        ['B', '2022-09-07', 1, 2], ['B', '2022-09-08', 1, 3], ['B', '2022-09-10', 1, 5]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'value', 'diff_days'])

   group        date  value  diff_days
0      A  2022-09-01      2         -3
1      A  2022-09-02      1         -2
2      A  2022-09-04      3          0
3      A  2022-09-06      2          2
4      A  2022-09-07      1          3
5      A  2022-09-07      3          3
6      A  2022-09-08      4          4
7      A  2022-09-08      2          4
8      B  2022-09-01      2         -4
9      B  2022-09-03      4         -2
10     B  2022-09-04      2         -1
11     B  2022-09-05      2          0
12     B  2022-09-07      1          2
13     B  2022-09-08      1          3
14     B  2022-09-10      1          5

For group A in this case the most centered date is 2022-09-04 and for group B it is 2022-09-05. They both have diff_days value of 0. You can calculate this by min (max - min)/2 with dates. So I was wondering if anyone knows how to do this calculation using Pandas?

CodePudding user response:

You can just define a new function as per your logic and use pandas.DataFrame.groupby.apply:

def get_center(dates):
    min_, max_ = dates.min(), dates.max()
    center = (min_   (max_ - min_)/2).floor("D")
    return (dates - center).dt.days

df["date"] = pd.to_datetime(df["date"])
df["diff_days"] = df.groupby("group")["date"].apply(get_center)

Output:

   group       date  value  diff_days
0      A 2022-09-01      2         -3
1      A 2022-09-02      1         -2
2      A 2022-09-04      3          0
3      A 2022-09-06      2          2
4      A 2022-09-07      1          3
5      A 2022-09-07      3          3
6      A 2022-09-08      4          4
7      A 2022-09-08      2          4
8      B 2022-09-01      2         -4
9      B 2022-09-03      4         -2
10     B 2022-09-04      2         -1
11     B 2022-09-05      2          0
12     B 2022-09-07      1          2
13     B 2022-09-08      1          3
14     B 2022-09-10      1          5
  • Related