Group by count dates and fill missing dates with 0-CodePudding

Please note that the real case the volume and the combination between state and country are much bigger, and not only to the value 'usa'

With the following dataframe:

import pandas as pd
data = pd.DataFrame({'state':['ny','sf','tx','ny','ny'],'country':['usa','usa','usa','usa','usa'],
                     'Date':['01/01/2020','01/01/2020','01/01/2020','01/02/2020','01/02/2020']})

I group the data to get the number of entries by city, country in a given date:

group_data = data.groupby(['state','country','Date']).size().to_frame().reset_index()

So I get:

|    | state   | country   | Date       |   0 |
|---:|:-------|:----------|:-----------|----:|
|  0 | ny     | usa       | 01/01/2020 |   1 |
|  1 | ny     | usa       | 01/02/2020 |   2 |
|  2 | sf     | usa       | 01/01/2020 |   1 |
|  3 | tx     | usa       | 01/01/2020 |   1 |

Ideally I would want for every entry in the dataframe, fill the date missing for the entire range max(data['Date']) and min(data['Date']) have the values filled for the subgroup state and country that's not present, so for example it should look like:

|    | state   | country   | Date       |   0 |
|---:|:-------|:----------|:-----------|----:|
|  0 | ny     | usa       | 01/01/2020 |   1 |
|  1 | ny     | usa       | 01/02/2020 |   2 |
|  2 | sf     | usa       | 01/01/2020 |   1 |
|  2 | sf     | usa       | 01/02/2020 |   0 |
|  3 | tx     | usa       | 01/01/2020 |   1 |
|  3 | tx     | usa       | 01/02/2020 |   0 |

Two entries for sf-usa and tx-usa have been added for the 2nd of Januarya 2020 with 0.

CodePudding user response：

Adding unstack and stack

out = data.groupby(['state','country','Date']).size().unstack(fill_value=0).stack().reset_index()
Out[276]: 
  state country        Date  0
0    ny     usa  01/01/2020  1
1    ny     usa  01/02/2020  2
2    sf     usa  01/01/2020  1
3    sf     usa  01/02/2020  0
4    tx     usa  01/01/2020  1
5    tx     usa  01/02/2020  0

CodePudding user response：

You can also use the complete function from pyjanitor, to exposes explicit missing values; it can also help for scenarios where there are duplicates (not relevant here, since groupby always returns uniques):

# pip install pyjanitor
import pandas as pd
import janitor as jn
(data.groupby(['state', 'country', 'Date'], as_index = False)
     .size()
     .complete('state', 'country', 'Date')
     .fillna(0)
     # to preserve data as integers
     .astype({'size':pd.Int64Dtype()})
)

  state country        Date  size
0    ny     usa  01/01/2020   1
1    ny     usa  01/02/2020   2
2    sf     usa  01/01/2020   1
3    sf     usa  01/02/2020   0
4    tx     usa  01/01/2020   1
5    tx     usa  01/02/2020   0