I have two data frames with the same variables but from different years:
df2016 = pd.DataFrame({"ID": [100,101,102,103], "A": [1,2,3,4], "B": [2,4,6,8], "year": [2016,2016,2016,2016]})
ID A B year
0 100 1 2 2016
1 101 2 4 2016
2 102 3 6 2016
3 103 4 8 2016
df2017 = pd.DataFrame({"ID": [100,101,102,103], "A": [5,6,7,8], "B": [9,11,13,15], "year": [2017,2017,2017,2017]})
ID A B year
0 100 5 9 2017
1 101 6 11 2017
2 102 7 13 2017
3 103 8 15 2017
I want to combine these and then first-difference the variable "A" wrt time to create a new variable "delta_A" that gives the change in "A" between 2017 and 2016, for every "ID". The result should be something like
Desired Output
A B delta_A
ID year
100 2016 1 2 NaN
101 2016 2 4 NaN
102 2016 3 6 NaN
103 2016 4 8 NaN
100 2017 5 9 4
101 2017 6 11 4
102 2017 7 13 4
103 2017 8 15 4
I want to work within a multi_index setting if possible since I gather that that will be useful for other things. Other answers show how to use series.diff() method to first-difference but do not show how to work with multi-index.
Here's what I tried:
df = pd.concat([df2016, df2017])
df
ID A B year
0 100 1 2 2016
1 101 2 4 2016
2 102 3 6 2016
3 103 4 8 2016
0 100 5 9 2017
1 101 6 11 2017
2 102 7 13 2017
3 103 8 15 2017
df.set_index(["ID", "year"], inplace=True)
df
A B
ID year
100 2016 1 2
101 2016 2 4
102 2016 3 6
103 2016 4 8
100 2017 5 9
101 2017 6 11
102 2017 7 13
103 2017 8 15
Actual Output (not desired)
df["delta_A"] = df["A"].diff()
df
A B delta_A
ID year
100 2016 1 2 NaN
101 2016 2 4 1.0
102 2016 3 6 1.0
103 2016 4 8 1.0
100 2017 5 9 1.0
101 2017 6 11 1.0
102 2017 7 13 1.0
103 2017 8 15 1.0
This is first-differencing wrt the ID dimension rather than the year dimension. I want it first-difference wrt to the year dimension as in my desired output above.
Another approach would be to create a lagged version of "A" and then substract it from "A", but I'm not sure how to do that within multi-index.
Thanks!
CodePudding user response:
In your given example, you can simply use groupby
diff
. Note: This approach will only work if dataframe is sorted by year
.
df['delta_A'] = df.groupby(level=0)['A'].diff()
If you like an alternative way to get the diff
by accessing the multiindex values manually, here is another more robust approach:
s1 = df.loc[(slice(None), 2017), 'A']
s2 = df.loc[(slice(None), 2016), 'A']
df['delta_A'] = s1.sub(s2.droplevel(1))
A B delta_A
ID year
100 2016 1 2 NaN
101 2016 2 4 NaN
102 2016 3 6 NaN
103 2016 4 8 NaN
100 2017 5 9 4.0
101 2017 6 11 4.0
102 2017 7 13 4.0
103 2017 8 15 4.0