Home > OS >  How to balance panel data by adding missing rows with no information?
How to balance panel data by adding missing rows with no information?

Time:05-10

I have an unbalanced dataset, namely unbalanced.df that looks as follows:

Date ID City State Quantity
2019-01 10001 Los Angeles CA 500
2019-02 10001 Los Angeles CA 995
2019-03 10001 Los Angeles CA 943
2019-01 10002 Houston TX 4330
2019-03 10002 Houston TX 2340
2019 -01 10003 Sacramento CA 235
2019-02 10003 Sacramento CA 239
2019-03 10003 Sacramento CA 233

As you can see, Houston does not have 2019-02 as a Date. This happens all throughout my panel data with different cities.

I want to make this panel symmetric by adding NA rows on the missing dates, such that the new panel data looks like this, balanced.df:

Date ID City State Quantity
2019-01 10001 Los Angeles CA 500
2019-02 10001 Los Angeles CA 995
2019-03 10001 Los Angeles CA 943
2019-01 10002 Houston TX 4330
2019-02 10002 Houston TX NaN
2019-03 10002 Houston TX 2340
2019 -01 10003 Sacramento CA 235
2019-02 10003 Sacramento CA 239
2019-03 10003 Sacramento CA 233

In this case, I have an absolute minimum date and absolute maximum date, so I want to make sure that all cities follow the same dates. How can I fill my panel with NaN rows for cities and have therefore the same number of rows for each ID, City and State?

CodePudding user response:

Try this use multiIndexes and reindex:

mapp = df.set_index('ID')[['City', 'State']].drop_duplicates()
df1 = df.set_index(['Date', 'ID'])\
        .reindex(pd.MultiIndex.from_product([df['Date'].unique(),
                                             df['ID'].unique()], 
                                            names=['Date', 'ID']))\
        .reset_index()
df1.assign(City=df1['ID'].map(mapp['City']), State=df1['ID'].map(mapp['State']))

Output:

      Date     ID         City State  Quantity
0  2019-01  10001  Los Angeles    CA     500.0
1  2019-01  10002      Houston    TX    4330.0
2  2019-01  10003   Sacramento    CA     235.0
3  2019-02  10001  Los Angeles    CA     995.0
4  2019-02  10002      Houston    TX       NaN
5  2019-02  10003   Sacramento    CA     239.0
6  2019-03  10001  Los Angeles    CA     943.0
7  2019-03  10002      Houston    TX    2340.0
8  2019-03  10003   Sacramento    CA     233.0

If you have a lot of columns, then you can use merge instead of assign:

df[['ID', 'City', 'State']].drop_duplicates().merge(df1[['ID', 'Quantity']], on='ID')
  • Related