I have a pandas DataFrame with two columns: id
and processing_date
.
The latter being the date an item (id) was processed.
import pandas as pd
# df
id processed_date
324 2016-07-08
A550 2016-07-09
79 2016-08-10
C295 2016-08-10
413 2016-08-11
...
111 2021-11-08
709 2021-11-08
I want to plot both a graph that shows the amount of items that were processed each months and a cumulative graph "over the months". As I have 5 years and 4 months of data, I must have 64 entries and 64 data points to plot as a bar or line plot.
This is what I get from here but it's not working as expected:
df['date'] = pd.to_datetime(df['processed_date']) # needed by the nature of the data
df.set_index('date')
df = df.groupby('date')['id'].count() # <- this will stack items per day
df = df.groupby(df.index.month)['id'].count() # <- this will stack items per 12 months, but I have 5 years and 4 months of data, hence 64 different months, not 12.
How could I achieve that?
Ideal output:
# df
nb_items_processed cum_sum year_month
2 2 2016-07
3 5 2016-08
...
2 xxx 2021-11
CodePudding user response:
Start out with groupby().size()
for the monthly count, then cumsum without groupby:
out = df.groupby(pd.Grouper(key='processed_date', freq='M')).size().reset_index(name='nb_items_processed')
out['cum_sum'] = out['nb_items_processed'].cumsum()