I have a pandas DataFrame exclusively with dates:
year month day date
0 2019 01 17 01-2019
1 2019 01 17 01-2019
2 2019 01 18 01-2019
3 2019 01 19 01-2019
4 2019 01 20 01-2019
...
336 2021 12 31 12-2021
337 2022 03 02 03-2022
338 2022 03 05 03-2022
339 2022 05 25 05-2022
340 2022 06 09 06-2022
Using groupby
I get a count for the number of monthly occurrences as seen below:
gh =df.groupby([df['year'], df['month']]).agg({'count'}).reset_index()
print(gh)
>>> year month day date
count count
0 2019 01 10 10
1 2019 02 6 6
2 2019 03 8 8
3 2019 04 2 2
4 2019 05 7 7
5 2019 06 8 8
6 2019 07 10 10
7 2019 08 6 6
8 2019 09 6 6
9 2019 10 6 6
10 2019 11 4 4
11 2019 12 3 3
12 2020 01 12 12
13 2020 02 6 6
14 2020 03 22 22
15 2020 04 17 17
16 2020 05 4 4
17 2020 06 9 9
18 2020 07 6 6
19 2020 08 8 8
20 2020 09 4 4
21 2020 10 7 7
22 2020 11 15 15
23 2020 12 15 15
24 2021 01 18 18
25 2021 02 22 22
26 2021 03 15 15
27 2021 04 19 19
28 2021 05 16 16
29 2021 06 23 23
30 2021 07 19 19
31 2021 08 1 1
32 2021 12 3 3
33 2022 03 2 2
34 2022 05 1 1
35 2022 06 1 1
(date is only used for plotting reasons). My issue is, come 09-2021 I have zero monthly counts and I want to obtain my gh
dataframe such that the missing rows look something like:
31 2021 08 1 1
32 2021 09 0 0
33 2021 10 0 0
34 2021 11 0 0
35 2021 12 0 0
...
All the way through 06-2022.
I encounter errors when I try using gh.reindex(pd.period_range(gh.index[0], gh.index[-1], freq='M'))
from this solution, as the monthly indexes repeat. I also think because I'm only working with dates as my data and not actual variables, this is messing things up, but i am trying to plot each month from 01/2019 - 06/2022 including the 0 counts for months in late 2021 and early 2022. How do I make this work?
I know there are a lot of threads similar to this, but they all use actual data counts, and not just date counts.
Edit: Here is the output from df.info()
:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341 entries, 0 to 340
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 341 non-null object
1 month 341 non-null object
2 day 341 non-null object
3 date 341 non-null object
dtypes: object(4)
memory usage: 10.8 KB
CodePudding user response:
Create a MultiIndex
then reindex
after groupby_agg
:
mi = pd.MultiIndex.from_product([df['year'].unique(), range(1, 13)],
names=['year', 'month'])
gh = (df.groupby([df['year'], df['month']]).agg({'count'})
.reindex(mi, fill_value=0).reset_index()
.droplevel(level=1, axis=1)))
Output:
>>> gh
year month day date
0 2019 1 5 5
1 2019 2 0 0
2 2019 3 0 0
3 2019 4 0 0
4 2019 5 0 0
5 2019 6 0 0
6 2019 7 0 0
7 2019 8 0 0
8 2019 9 0 0
9 2019 10 0 0
10 2019 11 0 0
11 2019 12 0 0
12 2021 1 0 0
13 2021 2 0 0
14 2021 3 0 0
15 2021 4 0 0
16 2021 5 0 0
17 2021 6 0 0
18 2021 7 0 0
19 2021 8 0 0
20 2021 9 0 0
21 2021 10 0 0
22 2021 11 0 0
23 2021 12 1 1
24 2022 1 0 0
25 2022 2 0 0
26 2022 3 2 2
27 2022 4 0 0
28 2022 5 1 1
29 2022 6 1 1
30 2022 7 0 0
31 2022 8 0 0
32 2022 9 0 0
33 2022 10 0 0
34 2022 11 0 0
35 2022 12 0 0