Please note that the real case the volume and the combination between state and country are much bigger, and not only to the value 'usa'
With the following dataframe:
import pandas as pd
data = pd.DataFrame({'state':['ny','sf','tx','ny','ny'],'country':['usa','usa','usa','usa','usa'],
'Date':['01/01/2020','01/01/2020','01/01/2020','01/02/2020','01/02/2020']})
I group the data to get the number of entries by city, country in a given date:
group_data = data.groupby(['state','country','Date']).size().to_frame().reset_index()
So I get:
| | state | country | Date | 0 |
|---:|:-------|:----------|:-----------|----:|
| 0 | ny | usa | 01/01/2020 | 1 |
| 1 | ny | usa | 01/02/2020 | 2 |
| 2 | sf | usa | 01/01/2020 | 1 |
| 3 | tx | usa | 01/01/2020 | 1 |
Ideally I would want for every entry in the dataframe, fill the date missing for the entire range max(data['Date']) and min(data['Date']) have the values filled for the subgroup state and country that's not present, so for example it should look like:
| | state | country | Date | 0 |
|---:|:-------|:----------|:-----------|----:|
| 0 | ny | usa | 01/01/2020 | 1 |
| 1 | ny | usa | 01/02/2020 | 2 |
| 2 | sf | usa | 01/01/2020 | 1 |
| 2 | sf | usa | 01/02/2020 | 0 |
| 3 | tx | usa | 01/01/2020 | 1 |
| 3 | tx | usa | 01/02/2020 | 0 |
Two entries for sf-usa and tx-usa have been added for the 2nd of Januarya 2020 with 0.
CodePudding user response:
Adding unstack
and stack
out = data.groupby(['state','country','Date']).size().unstack(fill_value=0).stack().reset_index()
Out[276]:
state country Date 0
0 ny usa 01/01/2020 1
1 ny usa 01/02/2020 2
2 sf usa 01/01/2020 1
3 sf usa 01/02/2020 0
4 tx usa 01/01/2020 1
5 tx usa 01/02/2020 0
CodePudding user response:
You can also use the complete
function from pyjanitor
, to exposes explicit missing values; it can also help for scenarios where there are duplicates (not relevant here, since groupby always returns uniques):
# pip install pyjanitor
import pandas as pd
import janitor as jn
(data.groupby(['state', 'country', 'Date'], as_index = False)
.size()
.complete('state', 'country', 'Date')
.fillna(0)
# to preserve data as integers
.astype({'size':pd.Int64Dtype()})
)
state country Date size
0 ny usa 01/01/2020 1
1 ny usa 01/02/2020 2
2 sf usa 01/01/2020 1
3 sf usa 01/02/2020 0
4 tx usa 01/01/2020 1
5 tx usa 01/02/2020 0