I'm analyzing a big database (snippet of it after description). To give some background, I'm analyzing company performance measured with patents and patent citations with respect to reelected CEOs.
In order to understand the impact that CEOs have had on their company I have now calculated the sum of the previous 3 entries for the column productivity (here called data
) such that:
productivity this year is equal to the sum of productivity this year last year the year before that.
The formula that I'm looking for now takes the productivity of this year calculated as above and goes back in time of 3 years. So, in short, I want to subtract year 1 entry to year 3 (this year) entry for all of the entries. Below there's an example of what I want to do in excel. In the example I'm going back by 4 years!
Now, I need to do this by company and CEO ID. As you can see, not all years have an entry so I can't always go back in time. In those cases I'd like for the entry to be NA
Here's a simplified database
ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4) #CEO ID
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c') #Company
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,44,68,100,5,3,7,6,9,3,5,6,7,44,33,NA)
df1 <- data.frame(ID,C,fyear, data)
ID C fyear data
1 a 2000 NA
1 a 2001 50
1 a 2002 22
1 a 2003 44
1 a 2004 68
1 a 2005 100
3 b 2000 NA
3 b 2001 3
3 b 2002 7
5 b 2003 6
5 b 2004 9
4 c 2000 3
4 c 2001 5
4 c 2002 6
4 c 2003 5
4 c 2004 44
4 c 2005 33
4 c 2006 NA
Here's the desired output. As you can tell, for some CEOs the entries are going to be all NAs
ID C fyear data diff.
1 a 2000 NA NA
1 a 2001 50 NA
1 a 2002 22 NA
1 a 2003 44 -6 # 44-50
1 a 2004 68 46 # 68-22
1 a 2005 11 -33 # 11-44
3 b 2000 NA NA
3 b 2001 3 NA
3 b 2002 7 NA
5 b 2003 6 NA # here the company is the same but I'm analyzing a different CEO (ID)
5 b 2004 9 NA
4 c 2000 3 NA
4 c 2001 5 NA
4 c 2002 6 3
4 c 2003 5 0
4 c 2004 44 38
4 c 2005 33 28
4 c 2006 NA NA
I tried to solve this using a few formulas found here but some of those were not working (I was getting all 0) and still, those were not helping with such a specific problem.
Thanks a lot, I hope somebody can shine some ligth on this.
CodePudding user response:
Is this what you want?
ID <- c(1, 1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 4, 4, 4, 4, 4, 4, 4) # CEO ID
C <- c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "c", "c") # Company
fyear <- c(2000, 2001, 2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2005, 2006)
data <- c(NA, 50, 22, 44, 68, 11, NA, 3, 7, 6, 9, 3, 5, 6, 5, 44, 33, NA)
df1 <- data.frame(ID, C, fyear, data)
library(dplyr)
df1 |>
group_by(ID, C) |>
mutate(diff = data - dplyr::lag(data, 2))
Output:
# A tibble: 18 × 5
# Groups: ID, C [4]
ID C fyear data diff
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 NA NA
2 1 a 2001 50 NA
3 1 a 2002 22 NA
4 1 a 2003 44 -6
5 1 a 2004 68 46
6 1 a 2005 11 -33
7 3 b 2000 NA NA
8 3 b 2001 3 NA
9 3 b 2002 7 NA
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 3 NA
13 4 c 2001 5 NA
14 4 c 2002 6 3
15 4 c 2003 5 0
16 4 c 2004 44 38
17 4 c 2005 33 28
18 4 c 2006 NA NA