Home > Back-end >  Subtracting elements with lag by group and ID in R taking into consideration NAs and adding Nas wher
Subtracting elements with lag by group and ID in R taking into consideration NAs and adding Nas wher

Time:11-23

I'm analyzing a big database (snippet of it after description). To give some background, I'm analyzing company performance measured with patents and patent citations with respect to reelected CEOs.

In order to understand the impact that CEOs have had on their company I have now calculated the sum of the previous 3 entries for the column productivity (here called data) such that:

productivity this year is equal to the sum of productivity this year last year the year before that.

The formula that I'm looking for now takes the productivity of this year calculated as above and goes back in time of 3 years. So, in short, I want to subtract year 1 entry to year 3 (this year) entry for all of the entries. Below there's an example of what I want to do in excel. In the example I'm going back by 4 years!

enter image description here

Now, I need to do this by company and CEO ID. As you can see, not all years have an entry so I can't always go back in time. In those cases I'd like for the entry to be NA

Here's a simplified database


ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4) #CEO ID
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c') #Company 
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,44,68,100,5,3,7,6,9,3,5,6,7,44,33,NA)
df1 <- data.frame(ID,C,fyear, data)

ID  C   fyear  data
1   a   2000    NA  
1   a   2001    50  
1   a   2002    22  
1   a   2003    44  
1   a   2004    68  
1   a   2005    100 
3   b   2000    NA  
3   b   2001    3   
3   b   2002    7   
5   b   2003    6   
5   b   2004    9   
4   c   2000    3   
4   c   2001    5   
4   c   2002    6   
4   c   2003    5   
4   c   2004    44  
4   c   2005    33  
4   c   2006    NA

Here's the desired output. As you can tell, for some CEOs the entries are going to be all NAs

ID  C   fyear  data diff.
1   a   2000    NA   NA
1   a   2001    50   NA
1   a   2002    22   NA
1   a   2003    44   -6  # 44-50
1   a   2004    68   46  # 68-22
1   a   2005    11   -33 # 11-44
3   b   2000    NA   NA
3   b   2001    3    NA
3   b   2002    7    NA
5   b   2003    6    NA  # here the company is the same but I'm analyzing a different CEO (ID)
5   b   2004    9    NA
4   c   2000    3    NA
4   c   2001    5    NA
4   c   2002    6    3
4   c   2003    5    0
4   c   2004    44   38
4   c   2005    33   28
4   c   2006    NA   NA

I tried to solve this using a few formulas found here but some of those were not working (I was getting all 0) and still, those were not helping with such a specific problem.

Thanks a lot, I hope somebody can shine some ligth on this.

CodePudding user response:

Is this what you want?

ID <- c(1, 1, 1, 1, 1, 1, 3, 3, 3, 5, 5, 4, 4, 4, 4, 4, 4, 4) # CEO ID
C <- c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "c", "c") # Company
fyear <- c(2000, 2001, 2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2005, 2006)
data <- c(NA, 50, 22, 44, 68, 11, NA, 3, 7, 6, 9, 3, 5, 6, 5, 44, 33, NA)
df1 <- data.frame(ID, C, fyear, data)

library(dplyr)
df1 |>
  group_by(ID, C) |>
  mutate(diff = data - dplyr::lag(data, 2))

Output:

# A tibble: 18 × 5
# Groups:   ID, C [4]
      ID C     fyear  data  diff
   <dbl> <chr> <dbl> <dbl> <dbl>
 1     1 a      2000    NA    NA
 2     1 a      2001    50    NA
 3     1 a      2002    22    NA
 4     1 a      2003    44    -6
 5     1 a      2004    68    46
 6     1 a      2005    11   -33
 7     3 b      2000    NA    NA
 8     3 b      2001     3    NA
 9     3 b      2002     7    NA
10     5 b      2003     6    NA
11     5 b      2004     9    NA
12     4 c      2000     3    NA
13     4 c      2001     5    NA
14     4 c      2002     6     3
15     4 c      2003     5     0
16     4 c      2004    44    38
17     4 c      2005    33    28
18     4 c      2006    NA    NA
  • Related