Home > other >  R: Cumulative Mean Excluding Current Value?
R: Cumulative Mean Excluding Current Value?

Time:12-09

I am working with the R programming language.

I have a dataset that looks something like this:

id = c(1,1,1,1,2,2,2)
year = c(2010,2011,2012,2013, 2012, 2013, 2014)
var = rnorm(7,7,7)

my_data = data.frame(id, year,var)

 id year       var
1  1 2010 12.186300
2  1 2011 19.069836
3  1 2012  7.456078
4  1 2013 14.875019
5  2 2012 20.827933
6  2 2013  5.029625
7  2 2014 -2.260658

For each "group" within the ID column - at each row, I want to take the CUMULATIVE MEAN of the "var" column but EXCLUDE the value of "var" within that row (i.e. most recent).

As an example:

  • row 1: NA
  • row 2: 12.186300/1
  • row 3: (12.186300 19.069836)/2
  • row 4: (12.186300 19.069836 7.45)/3
  • row 5: NA
  • row 6: 20.827933
  • row 7: (20.827933 5.029625)/2

I found this post here (Cumsum excluding current value) which (I think) shows how to do this for the "cumulative sum" - I tried to apply the logic here to my question:

transform(my_data, cmean = ave(var, id, FUN = cummean) - var)

  id year       var     cmean
1  1 2010 12.186300  0.000000
2  1 2011 19.069836 -3.441768
3  1 2012  7.456078  5.447994
4  1 2013 14.875019 -1.478211
5  2 2012 20.827933  0.000000
6  2 2013  5.029625  7.899154
7  2 2014 -2.260658 10.126291

The code appears to have run - but I don't think I have done this correctly (i.e. the numbers produced don't match up with the numbers I had anticipated).

I then tried an answer provided here (Compute mean excluding current value):

my_data %>%
    group_by(id) %>% 
    mutate(avg = (sum(var) - var)/(n() - 1))

# A tibble: 7 x 4
# Groups:   id [2]
     id  year   var   avg
  <dbl> <dbl> <dbl> <dbl>
1     1  2010 12.2  13.8 
2     1  2011 19.1  11.5 
3     1  2012  7.46 15.4 
4     1  2013 14.9  12.9 
5     2  2012 20.8   1.38
6     2  2013  5.03  9.28

But it is still not working.

Can someone please show me what I am doing wrong and what I can do this fix this problem?

Thanks!

CodePudding user response:

With the help of some intermediate variables you can do it like so:

library(dplyr)

df <- read.table(text = "
                  id year       var
1  1 2010 12.186300
2  1 2011 19.069836
3  1 2012  7.456078
4  1 2013 14.875019
5  2 2012 20.827933
6  2 2013  5.029625
7  2 2014 -2.260658", header=T)

df |> 
  group_by(id) |> 
  #mutate(avg =lag(cummean(var)))
  mutate(id_g = row_number()) |> 
  mutate(ms = cumsum(var)) |> 
  mutate(cm = ms/id_g,
         cm = ifelse(ms == cm, NA, cm)) |> 
  select(-id_g, -ms)
#> # A tibble: 7 × 4
#> # Groups:   id [2]
#>      id  year   var    cm
#>   <int> <int> <dbl> <dbl>
#> 1     1  2010 12.2  NA   
#> 2     1  2011 19.1  15.6 
#> 3     1  2012  7.46 12.9 
#> 4     1  2013 14.9  13.4 
#> 5     2  2012 20.8  NA   
#> 6     2  2013  5.03 12.9 
#> 7     2  2014 -2.26  7.87

CodePudding user response:

df %>%
   group_by(id)%>%
   mutate(avg = lag(cummean(var)))

# A tibble: 7 × 4
# Groups:   id [2]
     id  year   var   avg
  <int> <int> <dbl> <dbl>
1     1  2010 12.2   NA  
2     1  2011 19.1   12.2
3     1  2012  7.46  15.6
4     1  2013 14.9   12.9
5     2  2012 20.8   NA  
6     2  2013  5.03  20.8
7     2  2014 -2.26  12.9
  • Related