Home > Software design >  Pandas group and pick the min and the min of the next group
Pandas group and pick the min and the min of the next group

Time:04-07

I receive files with inventory of shoes and their brand, I need to know when was the first time a shoe was seen, and when was the last time it was seen for every brand. The last time it was seen should be computed as the date that we received a file and the shoe was not there.

There are three columns:

  1. model_id: The model of a shoe
  2. brand: The brand of a show
  3. release_date: When the catalog file was launched.

If I have a shoe from a brand that was seen on the release_date 1,2 but it's not anymore on 3, then the min_date will be 1 and the max_date will be 3.

If at a given time the release_date is the max_date available for that brand, then the min_date will still be the min value but the max_date will be Nat.

The min_date should never be bigger than max_date.

Having the input

data = pd.DataFrame({'model_id':[2,1,2,3],'brand':['nike','nike','nike','adidas'],'release_date':['01/03/2021','01/02/2021','01/01/2021','01/04/2021']}).sort_values('model_id')
'|    |   model_id | brand   | release_date   |
 |---:|-----------:|:--------|:---------------|
 |  1 |          1 | nike    | 01/02/2021     |
 |  0 |          2 | nike    | 01/03/2021     |
 |  2 |          2 | nike    | 01/01/2021     |
 |  3 |          3 | adidas  | 01/04/2021     |'

Should result in:

'|    |   model_id | brand   |     min_date   |     max_date   | 
 |---:|-----------:|:--------|:---------------|:---------------|
 |  1 |          1 | nike    | 01/02/2021     |    01/03/2021  |
 |  0 |          2 | nike    | 01/01/2021     |        NaT     |
 |  3 |          3 | adidas  | 01/04/2021     |        NaT     |'

The first row has min_date as the minimal date that appeared in a catalog and max_date because is picking the max(release_date) from the same brand.

The second row has min_date as the min(release_date) but max_date is NaT because it has the max(release_date) for its brand, in order to have a date there should be another max(release_date) for the same brand.

Ergo for the third row.

The logic I am having hard time implementing is for the max_date, whose has to be the value of the max(release_date) for every brand, unless that model_id has that very same release_date, then a missing value.

Tried everything in this question without any success because I didnt explain myself well enough

Pandas group by two fields, pick min date and next max date from other group

CodePudding user response:

This does what you ask for - first sorting by model_id and date and then groupby->first to get the first date, then using a series with the max date for each brand and checking if release_date is greater than it.

shoes = pd.DataFrame(
    {
        "model_id": [2, 1, 2, 3],
        "brand": ["nike", "nike", "nike", "adidas"],
        "release_date": ["01/03/2021", "01/02/2021", "01/01/2021", "01/04/2021"],
    }
)
shoes.release_date = pd.to_datetime(shoes.release_date)
shoes = (shoes
         .sort_values(["model_id", "release_date"])
         .groupby("model_id").first())

# find the "max_date" for each brand
max_dates_col = shoes.brand.map(
                    shoes.groupby("brand")["release_date"].max().to_dict())

shoes["max_date"] = max_dates_col.where(
    shoes.release_date < max_dates_col, other=pd.NaT
)
shoes = shoes.rename(columns={"release_date": "min_date"})
  • Related