Panel Data - dealing with missing year when creating lead and lag variables-CodePudding

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:

df = pd.DataFrame({'name': ['a']*4 ['b']*3 ['c']*4,
                   'year':[2001,2002,2004,2005] [2000,2002,2003] [2001,2002,2003,2005],
                   'val1':[1,2,3,4,5,6,7,8,9,10,11],
                   'val2':[2,5,7,11,13,17,19,23,29,31,37]})

   name  year  val1  val2
0     a  2001     1     2
1     a  2002     2     5
2     a  2004     3     7
3     a  2005     4    11
4     b  2000     5    13
5     b  2002     6    17
6     b  2003     7    19
7     c  2001     8    23
8     c  2002     9    29
9     c  2003    10    31
10    c  2005    11    37

Now I want to create lead and lag variables that are groupby name. Using:

df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)

This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:

   name  year  val1  val2  val1_lag  val1_lead
0     a  2001     1     2       NaN        2.0
1     a  2002     2     5       1.0        NaN
2     a  2004     3     7       NaN        4.0
3     a  2005     4    11       3.0        NaN
4     b  2000     5    13       NaN        NaN
5     b  2002     6    17       NaN        7.0
6     b  2003     7    19       6.0        NaN
7     c  2001     8    23       NaN        9.0
8     c  2002     9    29       8.0       10.0
9     c  2003    10    31       9.0        NaN
10    c  2005    11    37       NaN        NaN

My current work around solution is to fill is missing year by:

df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()

Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.

I am looking for a better approach for this scenario

CodePudding user response：

The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby

df['yearlag'] = (df['year'] == 1   df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None

df['yearlead'] = (df['year'] == -1   df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None

To create lag lead variables:

%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']

You can check if one with the merge method above, it is much more efficiency

%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year 1'), how='left')['val1']

CodePudding user response：

Don't use shift but a merge with the year ± 1:

df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year 1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']

Output:

   name  year  val1  val2  val1_lag  val1_lead
0     a  2001     1     2       NaN        2.0
1     a  2002     2     5       1.0        NaN
2     a  2004     3     7       NaN        4.0
3     a  2005     4    11       3.0        NaN
4     b  2000     5    13       NaN        NaN
5     b  2002     6    17       NaN        7.0
6     b  2003     7    19       6.0        NaN
7     c  2001     8    23       NaN        9.0
8     c  2002     9    29       8.0       10.0
9     c  2003    10    31       9.0        NaN
10    c  2005    11    37       NaN        NaN