Having the following dataframe:
import pandas as pd
import numpy as np
data = pd.DataFrame({'model_id': [1,1,2],
'brand': ['nike','nike','nike'],
'release_date': ['01/01/2021','01/02/2021','01/03/2021']})
data['release_date'] = pd.to_datetime(data['release_date'], format='%m/%d/%Y')
Looks like:
'| | model_id | brand | release_date |
|---:|-----------:|:--------|:--------------------|
| 0 | 1 | nike | 2021-01-01 00:00:00 |
| 1 | 1 | nike | 2021-01-02 00:00:00 |
| 2 | 2 | nike | 2021-01-03 00:00:00 |'
I want to group by model_id and brand, to get the max and min date, but the max_date of an id must be the min date_from the the next model of the same brand, and NaN if there is none( case for model_id 2).
The expected output is:
"| | min_date | max_date |
|:------------|--------------------:|---------------------:|
| (1, 'nike') | 2021-01-01 00:00:00 | 2021-01-03 00:00:00 |
| (2, 'nike') | 2021-01-03 00:00:00 | NaN |"
I have tried running this but I am unable to do the step of picking the next date of the same brand.
(data.groupby(['model_id','brand'])
.agg(min_date=('release_date', np.min), max_date=('release_date', np.max))
.to_markdown()
)
Note: with the dataframe in a different order the results should not vary
data = pd.DataFrame({'model_id':[2,1,2],'brand':['nike','nike','nike'],'release_date':['01/03/2021','01/02/2021','01/01/2021']})
(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'))
.assign(max_date=lambda d: d.groupby('brand')
['min_date'].shift(-1))
)
Yields a min_date bigger than max_date
"| | min_date | max_date |
|:------------|:-----------|:-----------|
| (1, 'nike') | 01/02/2021 | 01/01/2021 |
| (2, 'nike') | 01/01/2021 | nan |"
CodePudding user response:
shifting max_date per group
Here max_date is defined as the min_date of the previous id per brand
(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'))
.assign(max_date=lambda d: d.groupby('brand')['min_date'].shift(-1))
#.astype(str).to_markdown() # uncomment for markdown
)
output:
| | min_date | max_date |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-03 |
| (2, 'nike') | 2021-01-03 | NaT |
previous answer
You need to mask the data afterwards:
(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'), max_date=('release_date', 'max'))
.assign(max_date=lambda d: d['max_date'].mask(d['max_date'].eq(d['min_date'])))
#.astype(str).to_markdown() # uncomment for markdown
)
output (as markdown):
| | min_date | max_date |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-02 |
| (2, 'nike') | 2021-01-03 | NaT |