Determine number of entrants and exits in each year in panel data-CodePudding

I have three data frames with the same variables for the same firms but corresponding to different years (2016, 2017, 2018). The ID variable corresponds to a firm ID.

df2016 = pd.DataFrame({"ID": [99,101,102,103,104], "A": [1,2,3,4,5], "B": [2,4,6,8,10], "year": [2016,2016,2016,2016,2016]})
    ID  A   B   year
0   99  1   2   2016
1   101 2   4   2016
2   102 3   6   2016
3   103 4   8   2016
4   104 5   10  2016

df2017 = pd.DataFrame({"ID": [100,101,102,104], "A": [5,6,7,8], "B": [9,11,13,15], "year": [2017,2017,2017,2017]})

    ID  A   B   year
0   100 5   9   2017
1   101 6   11  2017
2   102 7   13  2017
3   104 8   15  2017

df2018 = pd.DataFrame({"ID": [100,106], "A": [6,8], "B": [13,15], "year": [2018,2018]})
    ID  A   B   year
0   100 6   13  2018
1   106 8   15  2018

Goal

I want to determine the number of entering firms and exiting firms in each year.

Specifically, I would like:

A new variable called entry that equals 1 if the firm ID was not in the data last year and is the data this year; and zero otherwise.
Another variable called exit that equals 1 if the firm ID is in the data this year but is not in the data next year; and zero otherwise.
Edge cases of entering in the first year or exiting in the last year could be handled however is best.

Desired Output (something like this):

            A   B    enter  exit
ID  year                
99  2016    1.0 2.0     1   1
101 2016    2.0 4.0     1   0
102 2016    3.0 6.0     1   0
103 2016    4.0 8.0     1   1
104 2016    5.0 10.0    1   0
100 2017    5.0 9.0     1   0
101 2017    6.0 11.0    0   1
102 2017    7.0 13.0    0   1
104 2017    8.0 15.0    0   1
100 2018    6.0 13.0    0   0
106 2018    8.0 15.0    1   0

I start by concatenating the three years of data and setting the multi-index to (ID, year):

df = pd.concat([df2016, df2017, df2018])

df.set_index(["ID", "year"], inplace=True)
           A    B
ID  year        
99  2016    1   2
101 2016    2   4
102 2016    3   6
103 2016    4   8
104 2016    5   10
100 2017    5   9
101 2017    6   11
102 2017    7   13
104 2017    8   15
100 2018    6   13
106 2018    8   15

Not sure where to go next.

CodePudding user response：

You can use a merge on the shifted years, if the value is found then we have a 0, then fill the NaNs with 1:

dfs = [df2016, df2017, df2018]
df = pd.concat(dfs)

cols = ['ID', 'year']
df = (df
 .merge(df[cols].assign(year=df['year'].add(1), enter=0), on=cols, how='left')
 .merge(df[cols].assign(year=df['year'].sub(1), exit=0), on=cols, how='left')
 .fillna({'enter': 1, 'exit': 1}, downcast='infer')
)

output:

     ID  A   B  year  enter  exit
0    99  1   2  2016      1     1
1   101  2   4  2016      1     0
2   102  3   6  2016      1     0
3   103  4   8  2016      1     1
4   104  5  10  2016      1     0
5   100  5   9  2017      1     0
6   101  6  11  2017      0     1
7   102  7  13  2017      0     1
8   104  8  15  2017      0     1
9   100  6  13  2018      0     1
10  106  8  15  2018      1     1

CodePudding user response：

You can use:

df = pd.concat([df2016, df2017, df2018], ignore_index=True)
g = df.groupby('ID')['year']
df['entry'] = g.diff(1).ne(1).astype(int)
df['exit'] = g.diff(-1).ne(-1).astype(int)
print(df)

# Output
     ID  A   B  year  entry  exit
0    99  1   2  2016      1     1
1   101  2   4  2016      1     0
2   102  3   6  2016      1     0
3   103  4   8  2016      1     1
4   104  5  10  2016      1     0
5   100  5   9  2017      1     0
6   101  6  11  2017      0     1
7   102  7  13  2017      0     1
8   104  8  15  2017      0     1
9   100  6  13  2018      0     1
10  106  8  15  2018      1     1