I have the following dataframe (sample):
import pandas as pd
data = [['A', '2022-09-01', 2], ['A', '2022-09-02', 1], ['A', '2022-09-04', 3], ['A', '2022-09-06', 2],
['A', '2022-09-07', 1], ['A', '2022-09-07', 3], ['A', '2022-09-08', 4], ['A', '2022-09-08', 2],
['B', '2022-09-01', 2], ['B', '2022-09-03', 4], ['B', '2022-09-04', 2], ['B', '2022-09-05', 2],
['B', '2022-09-07', 1], ['B', '2022-09-08', 1], ['B', '2022-09-10', 1]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'value'])
group date value
0 A 2022-09-01 2
1 A 2022-09-02 1
2 A 2022-09-04 3
3 A 2022-09-06 2
4 A 2022-09-07 1
5 A 2022-09-07 3
6 A 2022-09-08 4
7 A 2022-09-08 2
8 B 2022-09-01 2
9 B 2022-09-03 4
10 B 2022-09-04 2
11 B 2022-09-05 2
12 B 2022-09-07 1
13 B 2022-09-08 1
14 B 2022-09-10 1
I would like to create the column "diff_days" which shows the difference in days with its most centered date per group. The centered date should have a value of 0 in the diff_days column. Here is the desired output:
data = [['A', '2022-09-01', 2, -3], ['A', '2022-09-02', 1, -2], ['A', '2022-09-04', 3, 0], ['A', '2022-09-06', 2, 2],
['A', '2022-09-07', 1, 3], ['A', '2022-09-07', 3, 3], ['A', '2022-09-08', 4, 4], ['A', '2022-09-08', 2, 4],
['B', '2022-09-01', 2, -4], ['B', '2022-09-03', 4, -2], ['B', '2022-09-04', 2, -1], ['B', '2022-09-05', 2, 0],
['B', '2022-09-07', 1, 2], ['B', '2022-09-08', 1, 3], ['B', '2022-09-10', 1, 5]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'value', 'diff_days'])
group date value diff_days
0 A 2022-09-01 2 -3
1 A 2022-09-02 1 -2
2 A 2022-09-04 3 0
3 A 2022-09-06 2 2
4 A 2022-09-07 1 3
5 A 2022-09-07 3 3
6 A 2022-09-08 4 4
7 A 2022-09-08 2 4
8 B 2022-09-01 2 -4
9 B 2022-09-03 4 -2
10 B 2022-09-04 2 -1
11 B 2022-09-05 2 0
12 B 2022-09-07 1 2
13 B 2022-09-08 1 3
14 B 2022-09-10 1 5
For group A in this case the most centered date is 2022-09-04 and for group B it is 2022-09-05. They both have diff_days value of 0. You can calculate this by min (max - min)/2 with dates. So I was wondering if anyone knows how to do this calculation using Pandas
?
CodePudding user response:
You can just define a new function as per your logic and use pandas.DataFrame.groupby.apply
:
def get_center(dates):
min_, max_ = dates.min(), dates.max()
center = (min_ (max_ - min_)/2).floor("D")
return (dates - center).dt.days
df["date"] = pd.to_datetime(df["date"])
df["diff_days"] = df.groupby("group")["date"].apply(get_center)
Output:
group date value diff_days
0 A 2022-09-01 2 -3
1 A 2022-09-02 1 -2
2 A 2022-09-04 3 0
3 A 2022-09-06 2 2
4 A 2022-09-07 1 3
5 A 2022-09-07 3 3
6 A 2022-09-08 4 4
7 A 2022-09-08 2 4
8 B 2022-09-01 2 -4
9 B 2022-09-03 4 -2
10 B 2022-09-04 2 -1
11 B 2022-09-05 2 0
12 B 2022-09-07 1 2
13 B 2022-09-08 1 3
14 B 2022-09-10 1 5