Given the following dataset:
group_id | from_date | to_date |
---|---|---|
0 | 2020-01-01 | 2020-02-01 |
0 | 2020-02-01 | 2020-03-01 |
0 | 2020-03-01 | 2020-04-01 |
1 | 2020-01-01 | 2020-02-01 |
1 | 2020-02-01 | 2020-03-01 |
I'm trying to compute, for each group, max(to_date) - min(from_date)
, such as the result looks like ( - some days depending on months duration):
group_id | duration_days |
---|---|
0 | 90 |
1 | 60 |
Using the following correctly computes the duration, but returns an ungrouped dataframe of 5 rows:
groupby(["group_id"])
.apply(lambda x: x.assign(duration_days=(np.max(x["to_date"])-np.min(x["from_date"])).days))`
I've tried using aggregate
but I haven't managed to make it work with a function using two columns.
CodePudding user response:
Let us not do with apply lambda. melt
np.ptp
out = df.melt('group_id').groupby('group_id')['value'].agg(np.ptp).reset_index(name = 'duration_days')
Out[16]:
group_id duration_days
0 0 91 days
1 1 60 days
CodePudding user response:
You approach was almost correct, just aggregate as Series and take care of renaming afterwards:
# ensure datetime
df['from_date'] = pd.to_datetime(df['from_date'])
df['to_date'] = pd.to_datetime(df['to_date'])
(df.groupby('group_id')
.apply(lambda g: g['to_date'].max()-g['from_date'].min())
.reset_index(name='duration_days')
)
output:
group_id duration_days
0 0 91 days
1 1 60 days
CodePudding user response:
Compute the maximum and minimum per group (I am making sure they are datetime objects, they probably are already):
maxi = df.groupby('group_id').to_date.max()
mini = df.groupby('group_id').from_date.min()
Then subtract them from each other:
(maxi - mini).reset_index()
Output:
group_id to_date
0 0 91 days
1 1 60 days