Take min and max with null values - pandas groupby-CodePudding

I have a pandas.DataFrame, that looks like this:

| id |       start       |        end       |
|:--:|:-----------------:|:----------------:|
|  a |  1/1/20 12:00 AM  |  1/2/20 12:00 AM |
|  b |   1/1/20 6:37 PM  |  1/2/20 7:11 PM  |
|  b |   1/4/20 1:17 AM  |                  |
|  c |  2/4/20 12:00 AM  | 7/13/20 12:00 AM |
|  d |  4/19/20 8:45 PM  | 4/23/20 12:13 AM |
|  d | 11/21/20 12:00 AM |  3/2/21 12:00 AM |

I am trying to, for each id, determine the min() start and the max() end. My issue is, in some cases, the end can be blank and in that case, it should count as the max (i.e., the issue is not closed).

Ideally, the result would look like this:

| id |      start      |        end       |
|:--:|:---------------:|:----------------:|
|  a | 1/1/20 12:00 AM |  1/2/20 12:00 AM |
|  b |  1/1/20 6:37 PM |                  |
|  c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
|  d | 4/19/20 8:45 PM |  3/2/21 12:00 AM |

I have looked at this question for inspiration but have not found a solution.

MRE below:

import pandas as pd, numpy as np

df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})

df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()

_df = pd.merge(starts, ends, on='id')

However, that produces:

| id |      start      |       end      |
|:--:|:---------------:|:--------------:|
|  a |  1/1/2020 0:00  |  1/2/2020 0:00 |
|  b |  1/1/2020 18:37 | 1/2/2020 19:11 |
|  c |  2/4/2020 0:00  | 7/13/2020 0:00 |
|  d | 4/19/2020 20:45 |  3/2/2021 0:00 |

How can I accomplish my desired result?

CodePudding user response：

Sort by dates and use iloc to get the last value

df.sort_values(["start", "end"]).groupby("id").agg({"start": "first",
                                                    "end": lambda x: x.iloc[-1]})
#                  start        end
# id                               
# a  2020-01-01 00:00:00 2020-01-02
# b  2020-01-01 18:37:00        NaT
# c  2020-02-04 00:00:00 2020-07-13
# d  2020-04-19 20:45:00 2021-03-02

CodePudding user response：

IIUC,DataFrame.mask to set NaN where there are any nan for each group and col

new_df = \
df.groupby('id')\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())\
  .reset_index()

print(new_df)
  id               start        end
0  a 2020-01-01 00:00:00 2020-01-02
1  b 2020-01-01 18:37:00        NaT
2  c 2020-02-04 00:00:00 2020-07-13
3  d 2020-04-19 20:45:00 2021-03-02

Detail:

print(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())

    start    end
id              
a   False  False
b   False   True
c   False  False
d   False  False

if more group columns

new_df = \
df.groupby(['id', 'status'])\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby([df['id'], df['status']])
                            .max())\
  .reset_index()