Home > Enterprise >  How to take the mean of two subsequent rows iteratively thereby reducing the number of rows?
How to take the mean of two subsequent rows iteratively thereby reducing the number of rows?

Time:07-13

I have a tibble like so:

library(dplyr)
set.seed(1)
my_tib <- tibble(identifier = rep(letters[1:3], each = 4),
       year = rep(seq(2005, 2020, 5), 3),
       value = rnorm(12, mean = 1000, 100) %>% round()
       )

my_tib

# A tibble: 12 × 3
   identifier  year value
   <chr>      <dbl> <dbl>
 1 a           2005   937
 2 a           2010  1018
 3 a           2015   916
 4 a           2020  1160
 5 b           2005  1033
 6 b           2010   918
 7 b           2015  1049
 8 b           2020  1074
 9 c           2005  1058
10 c           2010   969
11 c           2015  1151
12 c           2020  1039

Now I'd like to shrink down my tibble by taking the mean value for two years each, creating a new column for the year bracket. For example, I'd like to take the mean of 937 and 1018 (977.5) for the new year_bracket 2005-2010.

I'd like to repeat this for all years and all identifiers.

So the first new 5 rows of my tibble look like this:

head(my_new_tib, 5)

# A tibble: 9 × 3
   identifier  year_bracket  value       
   <chr>       <chr>         <dbl>
 1 a           2005-2010     977.5
 2 a           2010-2015     967
 3 a           2015-2020     1038
 4 b           2005-2010     975.5
 5 b           2010-2015     983.5

Ideally, I'm looking for a piped dplyr solution but I'm also curious regarding other solutions.

CodePudding user response:

Using dplyr:

library(dplyr)
my_tib |> 
  group_by(identifier) |> 
  mutate(value = (value   lag(value))/2,
         year_bracket = paste0(lag(year)," - ",year),
         .keep = "unused",
         .before = 2) |> 
  filter(!is.na(value)) |> 
  ungroup()

Output:

# A tibble: 9 x 3
  identifier year_bracket value
  <chr>      <chr>        <dbl>
1 a          2005 - 2010   978.
2 a          2010 - 2015   967 
3 a          2015 - 2020  1038 
4 b          2005 - 2010   976.
5 b          2010 - 2015   984.
6 b          2015 - 2020  1062.
7 c          2005 - 2010  1014.
8 c          2010 - 2015  1060 
9 c          2015 - 2020  1095

CodePudding user response:

Another possible solution:

library(tidyverse)

my_tib %>%
  group_by(identifier) %>% 
  slice(c(1, rep(2:(n()-1), each = 2) , n())) %>%
  group_by(identifier, aux = rep(1:n(), each=2, length.out = n())) %>% 
  summarise(year_bracket = str_c(year, collapse = "_"), value = mean(value), 
    .groups = "drop") %>% select(-aux)

#> # A tibble: 9 × 3
#>   identifier year_bracket value
#>   <chr>      <chr>        <dbl>
#> 1 a          2005_2010     978.
#> 2 a          2010_2015     967 
#> 3 a          2015_2020    1038 
#> 4 b          2005_2010     976.
#> 5 b          2010_2015     984.
#> 6 b          2015_2020    1062.
#> 7 c          2005_2010    1014.
#> 8 c          2010_2015    1060 
#> 9 c          2015_2020    1095
  • Related