Pandas group by two fields, pick min date and next max date from other group-CodePudding

Having the following dataframe:

import pandas as pd
import numpy as np
data = pd.DataFrame({'model_id': [1,1,2],
                     'brand': ['nike','nike','nike'],
                     'release_date': ['01/01/2021','01/02/2021','01/03/2021']})
data['release_date'] = pd.to_datetime(data['release_date'], format='%m/%d/%Y')

Looks like:

'|    |   model_id | brand   | release_date        |
 |---:|-----------:|:--------|:--------------------|
 |  0 |          1 | nike    | 2021-01-01 00:00:00 |
 |  1 |          1 | nike    | 2021-01-02 00:00:00 |
 |  2 |          2 | nike    | 2021-01-03 00:00:00 |'

I want to group by model_id and brand, to get the max and min date, but the max_date of an id must be the min date_from the the next model of the same brand, and NaN if there is none( case for model_id 2).

The expected output is:

"|             |        min_date     |        max_date      |
 |:------------|--------------------:|---------------------:|
 | (1, 'nike') | 2021-01-01 00:00:00 | 2021-01-03 00:00:00  |
 | (2, 'nike') | 2021-01-03 00:00:00 |        NaN           |"

I have tried running this but I am unable to do the step of picking the next date of the same brand.

(data.groupby(['model_id','brand'])
     .agg(min_date=('release_date', np.min), max_date=('release_date', np.max))
     .to_markdown()
 )

Note: with the dataframe in a different order the results should not vary

data = pd.DataFrame({'model_id':[2,1,2],'brand':['nike','nike','nike'],'release_date':['01/03/2021','01/02/2021','01/01/2021']})

(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'))
.assign(max_date=lambda d: d.groupby('brand') 
['min_date'].shift(-1))
)

Yields a min_date bigger than max_date

"|             | min_date   | max_date   |
 |:------------|:-----------|:-----------|
 | (1, 'nike') | 01/02/2021 | 01/01/2021 |
 | (2, 'nike') | 01/01/2021 | nan        |"

CodePudding user response：

shifting max_date per group

Here max_date is defined as the min_date of the previous id per brand

(data
 .groupby(['model_id','brand'])
 .agg(min_date=('release_date', 'min'))
 .assign(max_date=lambda d: d.groupby('brand')['min_date'].shift(-1))
 #.astype(str).to_markdown() # uncomment for markdown
)

output:

|             | min_date   | max_date   |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-03 |
| (2, 'nike') | 2021-01-03 | NaT        |

previous answer

You need to mask the data afterwards:

(data
 .groupby(['model_id','brand'])
 .agg(min_date=('release_date', 'min'), max_date=('release_date', 'max'))
 .assign(max_date=lambda d: d['max_date'].mask(d['max_date'].eq(d['min_date'])))
 #.astype(str).to_markdown() # uncomment for markdown
)

output (as markdown):

|             | min_date   | max_date   |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-02 |
| (2, 'nike') | 2021-01-03 | NaT        |