I have an unbalanced dataset, namely unbalanced.df
that looks as follows:
Date | ID | City | State | Quantity |
---|---|---|---|---|
2019-01 | 10001 | Los Angeles | CA | 500 |
2019-02 | 10001 | Los Angeles | CA | 995 |
2019-03 | 10001 | Los Angeles | CA | 943 |
2019-01 | 10002 | Houston | TX | 4330 |
2019-03 | 10002 | Houston | TX | 2340 |
2019 -01 | 10003 | Sacramento | CA | 235 |
2019-02 | 10003 | Sacramento | CA | 239 |
2019-03 | 10003 | Sacramento | CA | 233 |
As you can see, Houston does not have 2019-02
as a Date. This happens all throughout my panel data with different cities.
I want to make this panel symmetric by adding NA rows on the missing dates, such that the new panel data looks like this, balanced.df
:
Date | ID | City | State | Quantity |
---|---|---|---|---|
2019-01 | 10001 | Los Angeles | CA | 500 |
2019-02 | 10001 | Los Angeles | CA | 995 |
2019-03 | 10001 | Los Angeles | CA | 943 |
2019-01 | 10002 | Houston | TX | 4330 |
2019-02 | 10002 | Houston | TX | NaN |
2019-03 | 10002 | Houston | TX | 2340 |
2019 -01 | 10003 | Sacramento | CA | 235 |
2019-02 | 10003 | Sacramento | CA | 239 |
2019-03 | 10003 | Sacramento | CA | 233 |
In this case, I have an absolute minimum date and absolute maximum date, so I want to make sure that all cities follow the same dates. How can I fill my panel with NaN rows for cities and have therefore the same number of rows for each ID, City and State?
CodePudding user response:
Try this use multiIndexes and reindex:
mapp = df.set_index('ID')[['City', 'State']].drop_duplicates()
df1 = df.set_index(['Date', 'ID'])\
.reindex(pd.MultiIndex.from_product([df['Date'].unique(),
df['ID'].unique()],
names=['Date', 'ID']))\
.reset_index()
df1.assign(City=df1['ID'].map(mapp['City']), State=df1['ID'].map(mapp['State']))
Output:
Date ID City State Quantity
0 2019-01 10001 Los Angeles CA 500.0
1 2019-01 10002 Houston TX 4330.0
2 2019-01 10003 Sacramento CA 235.0
3 2019-02 10001 Los Angeles CA 995.0
4 2019-02 10002 Houston TX NaN
5 2019-02 10003 Sacramento CA 239.0
6 2019-03 10001 Los Angeles CA 943.0
7 2019-03 10002 Houston TX 2340.0
8 2019-03 10003 Sacramento CA 233.0
If you have a lot of columns, then you can use merge
instead of assign
:
df[['ID', 'City', 'State']].drop_duplicates().merge(df1[['ID', 'Quantity']], on='ID')