Suppose we are working with an unbalanced panel that looks like the below dataframe.
df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'], 'Year': [2000, 2001, 2003, 2004, 2005, 2002, 2001, 2003], 'Var': [1, 4, 6, 8, 10, 12, 15, 17]})
print(df)
id Year Var
0 1 2000 1
1 1 2001 4
2 1 2003 6
3 2 2004 8
4 2 2005 10
5 3 2002 12
6 4 2001 15
7 4 2003 17
The objective is to add rows with variable column equal to zero for ids that do not contain consecutive observations. For example, since id =1 has a row reported for 2000, 2001, and 2003, we need to aded the following row to the dataframe
df_append = pd.DataFrame({'id':[1], 'Year':[2002], 'Var':[0]})
We would also need to do this for id 4.
df_append2 = pd.DataFrame({'id':[4], 'Year':[2002], 'Var':[0]})
We would then retrieve the following dataframe:
id Year Var
0 1 2000 1
1 1 2001 4
2 1 2002 0
3 1 2003 6
4 2 2004 8
5 2 2005 10
6 3 2002 12
7 4 2001 15
8 4 2002 0
9 4 2003 17
Was thinking of doing this in a forloop but did not know how to retrieve a dictionary of id and year made up of those that are missing observations.
CodePudding user response:
One option is with complete
from pyjanitor
:
# pip install pyjanitor
import janitor
import pandas as pd
new_year = {'Year':lambda year: range(year.min(), year.max() 1)}
df.complete(new_year, by = 'id', fill_value=0)
Out[79]:
id Year Var
0 1 2000 1
1 1 2001 4
2 1 2002 0
3 1 2003 6
4 2 2004 8
5 2 2005 10
6 3 2002 12
7 4 2001 15
8 4 2002 0
9 4 2003 17
You can avoid pyjanitor and use a groupby:
(df
.set_index('Year')
.groupby('id')
.apply(lambda df: df.reindex(range(df.index.min(),
df.index.max() 1),
fill_value=0))
.drop(columns='id')
.reset_index()
)
id Year Var
0 1 2000 1
1 1 2001 4
2 1 2002 0
3 1 2003 6
4 2 2004 8
5 2 2005 10
6 3 2002 12
7 4 2001 15
8 4 2002 0
9 4 2003 17