I have a pandas.DataFrame
, that looks like this:
| id | start | end |
|:--:|:-----------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | 1/2/20 7:11 PM |
| b | 1/4/20 1:17 AM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 4/23/20 12:13 AM |
| d | 11/21/20 12:00 AM | 3/2/21 12:00 AM |
I am trying to, for each id
, determine the min()
start
and the max()
end
. My issue is, in some cases, the end
can be blank and in that case, it should count as the max (i.e., the issue is not closed).
Ideally, the result would look like this:
| id | start | end |
|:--:|:---------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 3/2/21 12:00 AM |
I have looked at this question for inspiration but have not found a solution.
MRE below:
import pandas as pd, numpy as np
df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()
_df = pd.merge(starts, ends, on='id')
However, that produces:
| id | start | end |
|:--:|:---------------:|:--------------:|
| a | 1/1/2020 0:00 | 1/2/2020 0:00 |
| b | 1/1/2020 18:37 | 1/2/2020 19:11 |
| c | 2/4/2020 0:00 | 7/13/2020 0:00 |
| d | 4/19/2020 20:45 | 3/2/2021 0:00 |
How can I accomplish my desired result?
CodePudding user response:
Sort by dates and use iloc
to get the last value
df.sort_values(["start", "end"]).groupby("id").agg({"start": "first",
"end": lambda x: x.iloc[-1]})
# start end
# id
# a 2020-01-01 00:00:00 2020-01-02
# b 2020-01-01 18:37:00 NaT
# c 2020-02-04 00:00:00 2020-07-13
# d 2020-04-19 20:45:00 2021-03-02
CodePudding user response:
IIUC,DataFrame.mask
to set NaN where there are any nan for each group and col
new_df = \
df.groupby('id')\
.agg({'start':'min', 'end':'max'})\
.mask(df[['start', 'end']].isna()
.groupby(df['id'])
.max())\
.reset_index()
print(new_df)
id start end
0 a 2020-01-01 00:00:00 2020-01-02
1 b 2020-01-01 18:37:00 NaT
2 c 2020-02-04 00:00:00 2020-07-13
3 d 2020-04-19 20:45:00 2021-03-02
Detail:
print(df[['start', 'end']].isna()
.groupby(df['id'])
.max())
start end
id
a False False
b False True
c False False
d False False
if more group columns
new_df = \
df.groupby(['id', 'status'])\
.agg({'start':'min', 'end':'max'})\
.mask(df[['start', 'end']].isna()
.groupby([df['id'], df['status']])
.max())\
.reset_index()