How to balance panel data by adding missing rows with no information?-CodePudding

I have an unbalanced dataset, namely unbalanced.df that looks as follows:

Date	ID	City	State	Quantity
2019-01	10001	Los Angeles	CA	500
2019-02	10001	Los Angeles	CA	995
2019-03	10001	Los Angeles	CA	943
2019-01	10002	Houston	TX	4330
2019-03	10002	Houston	TX	2340
2019 -01	10003	Sacramento	CA	235
2019-02	10003	Sacramento	CA	239
2019-03	10003	Sacramento	CA	233

As you can see, Houston does not have 2019-02 as a Date. This happens all throughout my panel data with different cities.

I want to make this panel symmetric by adding NA rows on the missing dates, such that the new panel data looks like this, balanced.df:

Date	ID	City	State	Quantity
2019-01	10001	Los Angeles	CA	500
2019-02	10001	Los Angeles	CA	995
2019-03	10001	Los Angeles	CA	943
2019-01	10002	Houston	TX	4330
2019-02	10002	Houston	TX	NaN
2019-03	10002	Houston	TX	2340
2019 -01	10003	Sacramento	CA	235
2019-02	10003	Sacramento	CA	239
2019-03	10003	Sacramento	CA	233

In this case, I have an absolute minimum date and absolute maximum date, so I want to make sure that all cities follow the same dates. How can I fill my panel with NaN rows for cities and have therefore the same number of rows for each ID, City and State?

CodePudding user response：

Try this use multiIndexes and reindex:

mapp = df.set_index('ID')[['City', 'State']].drop_duplicates()
df1 = df.set_index(['Date', 'ID'])\
        .reindex(pd.MultiIndex.from_product([df['Date'].unique(),
                                             df['ID'].unique()], 
                                            names=['Date', 'ID']))\
        .reset_index()
df1.assign(City=df1['ID'].map(mapp['City']), State=df1['ID'].map(mapp['State']))

Output:

      Date     ID         City State  Quantity
0  2019-01  10001  Los Angeles    CA     500.0
1  2019-01  10002      Houston    TX    4330.0
2  2019-01  10003   Sacramento    CA     235.0
3  2019-02  10001  Los Angeles    CA     995.0
4  2019-02  10002      Houston    TX       NaN
5  2019-02  10003   Sacramento    CA     239.0
6  2019-03  10001  Los Angeles    CA     943.0
7  2019-03  10002      Houston    TX    2340.0
8  2019-03  10003   Sacramento    CA     233.0

If you have a lot of columns, then you can use merge instead of assign:

df[['ID', 'City', 'State']].drop_duplicates().merge(df1[['ID', 'Quantity']], on='ID')