Home > OS >  Pandas group by two fields, pick min date and next max date from other group
Pandas group by two fields, pick min date and next max date from other group

Time:04-06

Having the following dataframe:

import pandas as pd
import numpy as np
data = pd.DataFrame({'model_id': [1,1,2],
                     'brand': ['nike','nike','nike'],
                     'release_date': ['01/01/2021','01/02/2021','01/03/2021']})
data['release_date'] = pd.to_datetime(data['release_date'], format='%m/%d/%Y')

Looks like:

'|    |   model_id | brand   | release_date        |
 |---:|-----------:|:--------|:--------------------|
 |  0 |          1 | nike    | 2021-01-01 00:00:00 |
 |  1 |          1 | nike    | 2021-01-02 00:00:00 |
 |  2 |          2 | nike    | 2021-01-03 00:00:00 |'

I want to group by model_id and brand, to get the max and min date, but the max_date of an id must be the min date_from the the next model of the same brand, and NaN if there is none( case for model_id 2).

The expected output is:

"|             |        min_date     |        max_date      |
 |:------------|--------------------:|---------------------:|
 | (1, 'nike') | 2021-01-01 00:00:00 | 2021-01-03 00:00:00  |
 | (2, 'nike') | 2021-01-03 00:00:00 |        NaN           |"

I have tried running this but I am unable to do the step of picking the next date of the same brand.

(data.groupby(['model_id','brand'])
     .agg(min_date=('release_date', np.min), max_date=('release_date', np.max))
     .to_markdown()
 )

Note: with the dataframe in a different order the results should not vary

data = pd.DataFrame({'model_id':[2,1,2],'brand':['nike','nike','nike'],'release_date':['01/03/2021','01/02/2021','01/01/2021']})

(data
.groupby(['model_id','brand'])
.agg(min_date=('release_date', 'min'))
.assign(max_date=lambda d: d.groupby('brand') 
['min_date'].shift(-1))
)

Yields a min_date bigger than max_date

"|             | min_date   | max_date   |
 |:------------|:-----------|:-----------|
 | (1, 'nike') | 01/02/2021 | 01/01/2021 |
 | (2, 'nike') | 01/01/2021 | nan        |"

CodePudding user response:

shifting max_date per group

Here max_date is defined as the min_date of the previous id per brand

(data
 .groupby(['model_id','brand'])
 .agg(min_date=('release_date', 'min'))
 .assign(max_date=lambda d: d.groupby('brand')['min_date'].shift(-1))
 #.astype(str).to_markdown() # uncomment for markdown
)

output:

|             | min_date   | max_date   |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-03 |
| (2, 'nike') | 2021-01-03 | NaT        |
previous answer

You need to mask the data afterwards:

(data
 .groupby(['model_id','brand'])
 .agg(min_date=('release_date', 'min'), max_date=('release_date', 'max'))
 .assign(max_date=lambda d: d['max_date'].mask(d['max_date'].eq(d['min_date'])))
 #.astype(str).to_markdown() # uncomment for markdown
)

output (as markdown):

|             | min_date   | max_date   |
|:------------|:-----------|:-----------|
| (1, 'nike') | 2021-01-01 | 2021-01-02 |
| (2, 'nike') | 2021-01-03 | NaT        |
  • Related