I receive files with inventory of shoes and their brand, I need to know when was the first time a shoe was seen, and when was the last time it was seen for every brand. The last time it was seen should be computed as the date that we received a file and the shoe was not there.
There are three columns:
- model_id: The model of a shoe
- brand: The brand of a show
- release_date: When the catalog file was launched.
If I have a shoe from a brand that was seen on the release_date 1,2 but it's not anymore on 3, then the min_date will be 1 and the max_date will be 3.
If at a given time the release_date is the max_date available for that brand, then the min_date will still be the min value but the max_date will be Nat.
The min_date should never be bigger than max_date.
Having the input
data = pd.DataFrame({'model_id':[2,1,2,3],'brand':['nike','nike','nike','adidas'],'release_date':['01/03/2021','01/02/2021','01/01/2021','01/04/2021']}).sort_values('model_id')
'| | model_id | brand | release_date |
|---:|-----------:|:--------|:---------------|
| 1 | 1 | nike | 01/02/2021 |
| 0 | 2 | nike | 01/03/2021 |
| 2 | 2 | nike | 01/01/2021 |
| 3 | 3 | adidas | 01/04/2021 |'
Should result in:
'| | model_id | brand | min_date | max_date |
|---:|-----------:|:--------|:---------------|:---------------|
| 1 | 1 | nike | 01/02/2021 | 01/03/2021 |
| 0 | 2 | nike | 01/01/2021 | NaT |
| 3 | 3 | adidas | 01/04/2021 | NaT |'
The first row has min_date as the minimal date that appeared in a catalog and max_date because is picking the max(release_date) from the same brand.
The second row has min_date as the min(release_date) but max_date is NaT because it has the max(release_date) for its brand, in order to have a date there should be another max(release_date) for the same brand.
Ergo for the third row.
The logic I am having hard time implementing is for the max_date, whose has to be the value of the max(release_date) for every brand, unless that model_id has that very same release_date, then a missing value.
Tried everything in this question without any success because I didnt explain myself well enough
Pandas group by two fields, pick min date and next max date from other group
CodePudding user response:
This does what you ask for - first sorting by model_id and date and then groupby->first
to get the first date, then using a series with the max date for each brand and checking if release_date is greater than it.
shoes = pd.DataFrame(
{
"model_id": [2, 1, 2, 3],
"brand": ["nike", "nike", "nike", "adidas"],
"release_date": ["01/03/2021", "01/02/2021", "01/01/2021", "01/04/2021"],
}
)
shoes.release_date = pd.to_datetime(shoes.release_date)
shoes = (shoes
.sort_values(["model_id", "release_date"])
.groupby("model_id").first())
# find the "max_date" for each brand
max_dates_col = shoes.brand.map(
shoes.groupby("brand")["release_date"].max().to_dict())
shoes["max_date"] = max_dates_col.where(
shoes.release_date < max_dates_col, other=pd.NaT
)
shoes = shoes.rename(columns={"release_date": "min_date"})