I have three data frames with the same variables for the same firms but corresponding to different years (2016, 2017, 2018). The ID variable corresponds to a firm ID.
df2016 = pd.DataFrame({"ID": [99,101,102,103,104], "A": [1,2,3,4,5], "B": [2,4,6,8,10], "year": [2016,2016,2016,2016,2016]})
ID A B year
0 99 1 2 2016
1 101 2 4 2016
2 102 3 6 2016
3 103 4 8 2016
4 104 5 10 2016
df2017 = pd.DataFrame({"ID": [100,101,102,104], "A": [5,6,7,8], "B": [9,11,13,15], "year": [2017,2017,2017,2017]})
ID A B year
0 100 5 9 2017
1 101 6 11 2017
2 102 7 13 2017
3 104 8 15 2017
df2018 = pd.DataFrame({"ID": [100,106], "A": [6,8], "B": [13,15], "year": [2018,2018]})
ID A B year
0 100 6 13 2018
1 106 8 15 2018
Goal
I want to determine the number of entering firms and exiting firms in each year.
Specifically, I would like:
- A new variable called
entry
that equals 1 if the firm ID was not in the data last year and is the data this year; and zero otherwise. - Another variable called
exit
that equals 1 if the firm ID is in the data this year but is not in the data next year; and zero otherwise. - Edge cases of entering in the first year or exiting in the last year could be handled however is best.
Desired Output (something like this):
A B enter exit
ID year
99 2016 1.0 2.0 1 1
101 2016 2.0 4.0 1 0
102 2016 3.0 6.0 1 0
103 2016 4.0 8.0 1 1
104 2016 5.0 10.0 1 0
100 2017 5.0 9.0 1 0
101 2017 6.0 11.0 0 1
102 2017 7.0 13.0 0 1
104 2017 8.0 15.0 0 1
100 2018 6.0 13.0 0 0
106 2018 8.0 15.0 1 0
I start by concatenating the three years of data and setting the multi-index to (ID, year):
df = pd.concat([df2016, df2017, df2018])
df.set_index(["ID", "year"], inplace=True)
A B
ID year
99 2016 1 2
101 2016 2 4
102 2016 3 6
103 2016 4 8
104 2016 5 10
100 2017 5 9
101 2017 6 11
102 2017 7 13
104 2017 8 15
100 2018 6 13
106 2018 8 15
Not sure where to go next.
CodePudding user response:
You can use a merge
on the shifted years, if the value is found then we have a 0, then fill the NaNs with 1:
dfs = [df2016, df2017, df2018]
df = pd.concat(dfs)
cols = ['ID', 'year']
df = (df
.merge(df[cols].assign(year=df['year'].add(1), enter=0), on=cols, how='left')
.merge(df[cols].assign(year=df['year'].sub(1), exit=0), on=cols, how='left')
.fillna({'enter': 1, 'exit': 1}, downcast='infer')
)
output:
ID A B year enter exit
0 99 1 2 2016 1 1
1 101 2 4 2016 1 0
2 102 3 6 2016 1 0
3 103 4 8 2016 1 1
4 104 5 10 2016 1 0
5 100 5 9 2017 1 0
6 101 6 11 2017 0 1
7 102 7 13 2017 0 1
8 104 8 15 2017 0 1
9 100 6 13 2018 0 1
10 106 8 15 2018 1 1
CodePudding user response:
You can use:
df = pd.concat([df2016, df2017, df2018], ignore_index=True)
g = df.groupby('ID')['year']
df['entry'] = g.diff(1).ne(1).astype(int)
df['exit'] = g.diff(-1).ne(-1).astype(int)
print(df)
# Output
ID A B year entry exit
0 99 1 2 2016 1 1
1 101 2 4 2016 1 0
2 102 3 6 2016 1 0
3 103 4 8 2016 1 1
4 104 5 10 2016 1 0
5 100 5 9 2017 1 0
6 101 6 11 2017 0 1
7 102 7 13 2017 0 1
8 104 8 15 2017 0 1
9 100 6 13 2018 0 1
10 106 8 15 2018 1 1