Suppose I have next df N03_zero
(date_code
is already datetime
):
item_code date_code
8028558104973 2022-01-01
8028558104973 2022-01-02
8028558104973 2022-01-03
8028558104973 2022-01-06
8028558104973 2022-01-07
7622300443269 2022-01-01
7622300443269 2022-01-10
7622300443269 2022-01-11
513082 2022-01-01
513082 2022-01-02
513082 2022-01-03
Millions of rows with date_code
assigned to some item_code
.
I am trying to get the number of days of each continuous period for each item_code
, all other similar questions doesn't helped me.
The expected df should be:
item_code continuous_days
8028558104973 3
8028558104973 2
7622300443269 1
7622300443269 2
513082 3
Once days sequence breaks, it should count days in this sequence and then start to count again.
The aim is, to able to get then the dataframe with count
, min
, max
, and mean
for each item_code
.
Like this:
item_code no. periods min max mean
8028558104973 2 2 3 2.5
7622300443269 2 1 2 1.5
513082 1 3 3 3
Any suggestions?
CodePudding user response:
For consecutive days compare difference by Series.diff
in days by Series.dt.days
for not equal 1
by Series.ne
with cumulative sum by Series.cumsum
and then use GroupBy.size
, remove second level by DataFrame.droplevel
and create DataFrame
:
df['date_code'] = pd.to_datetime(df['date_code'])
df1= (df.groupby(['item_code',df['date_code'].diff().dt.days.ne(1).cumsum()], sort=False)
.size()
.droplevel(1)
.reset_index(name='continuous_days'))
print (df1)
item_code continuous_days
0 8028558104973 3
1 8028558104973 2
2 7622300443269 1
3 7622300443269 2
4 513082 3
And then aggregate values by named aggregations by GroupBy.agg
:
df2 = (df1.groupby('item_code', sort=False, as_index=False)
.agg(**{'no. periods': ('continuous_days','size'),
'min':('continuous_days','min'),
'max':('continuous_days','max'),
'mean':('continuous_days','mean')}))
print (df2)
item_code no. periods min max mean
0 8028558104973 2 2 3 2.5
1 7622300443269 2 1 2 1.5
2 513082 1 3 3 3.0