Home > OS >  Adding observations for each id with missing year observations in unbalanced panel
Adding observations for each id with missing year observations in unbalanced panel

Time:06-03

Suppose we are working with an unbalanced panel that looks like the below dataframe.

df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'], 'Year': [2000, 2001, 2003, 2004, 2005, 2002, 2001, 2003], 'Var': [1, 4, 6, 8, 10, 12, 15, 17]})

print(df)
  id  Year  Var
0  1  2000    1
1  1  2001    4
2  1  2003    6
3  2  2004    8
4  2  2005   10
5  3  2002   12
6  4  2001   15
7  4  2003   17

The objective is to add rows with variable column equal to zero for ids that do not contain consecutive observations. For example, since id =1 has a row reported for 2000, 2001, and 2003, we need to aded the following row to the dataframe

df_append = pd.DataFrame({'id':[1], 'Year':[2002], 'Var':[0]})

We would also need to do this for id 4.

df_append2 = pd.DataFrame({'id':[4], 'Year':[2002], 'Var':[0]})

We would then retrieve the following dataframe:

  id  Year  Var
0  1  2000    1
1  1  2001    4
2  1  2002    0
3  1  2003    6
4  2  2004    8
5  2  2005   10
6  3  2002   12
7  4  2001   15
8  4  2002   0
9  4  2003   17

Was thinking of doing this in a forloop but did not know how to retrieve a dictionary of id and year made up of those that are missing observations.

CodePudding user response:

One option is with complete from pyjanitor:

# pip install pyjanitor
import janitor
import pandas as pd
new_year = {'Year':lambda year: range(year.min(), year.max() 1)}

df.complete(new_year, by = 'id', fill_value=0)
Out[79]:
  id  Year  Var
0  1  2000    1
1  1  2001    4
2  1  2002    0
3  1  2003    6
4  2  2004    8
5  2  2005   10
6  3  2002   12
7  4  2001   15
8  4  2002    0
9  4  2003   17

You can avoid pyjanitor and use a groupby:

(df
.set_index('Year')
.groupby('id')
.apply(lambda df: df.reindex(range(df.index.min(), 
                                   df.index.max()   1), 
                             fill_value=0))
.drop(columns='id')
.reset_index()
)

  id  Year  Var
0  1  2000    1
1  1  2001    4
2  1  2002    0
3  1  2003    6
4  2  2004    8
5  2  2005   10
6  3  2002   12
7  4  2001   15
8  4  2002    0
9  4  2003   17
  • Related