calculate 5 year average of panel data with factors kept-CodePudding

I have a panel data set that may look like

set.seed(123)
df <- data.frame(
  year = rep(2011:2020,5),
  county = rep(c("a","b",'c','d','e'), each=10),
  state = rep(c("A","B",'C','D','E'), each=10),
  country = rep(c("AA","BB",'CC','DD','EE'), each=10),
  var1 = runif(50, 0, 50),
  var2 = runif(50, 50, 100)
)

I want to transform the panel data set to 5 year averages of the counties by

df <- df %>% 
  mutate(period = cut(df$year, seq(2011, 2021, by = 5),right = F)) %>% 
  group_by(county, period) %>% 
  summarise_all(mean)

The data set looks like

   county period       year state country  var1  var2
   <chr>  <fct>       <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 a      [2011,2016)  2013    NA      NA  33.1  69.7
 2 a      [2016,2021)  2018    NA      NA  24.7  73.6
 3 b      [2011,2016)  2013    NA      NA  27.6  72.3
 4 b      [2016,2021)  2018    NA      NA  24.7  83.1
 5 c      [2011,2016)  2013    NA      NA  38.7  75.7
 6 c      [2016,2021)  2018    NA      NA  22.8  66.8
 7 d      [2011,2016)  2013    NA      NA  33.8  72.2
 8 d      [2016,2021)  2018    NA      NA  20.0  83.7
 9 e      [2011,2016)  2013    NA      NA  14.9  71.0
10 e      [2016,2021)  2018    NA      NA  19.6  70.4

The warming messages are, for example

In mean.default(state) :
  argument is not numeric or logical: returning NA

Is there a smart way (not by merging as actually, I have a lot of character columns) to keep the time-invariant character of each county after the transformation? What I desire is

   county period       year state country  var1  var2
   <chr>  <fct>       <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 a      [2011,2016)  2013    A      AA  33.1  69.7
 2 a      [2016,2021)  2018    A      AA  24.7  73.6
 3 b      [2011,2016)  2013    B      BB  27.6  72.3
 4 b      [2016,2021)  2018    B      BB  24.7  83.1
 5 c      [2011,2016)  2013    C      CC  38.7  75.7
 6 c      [2016,2021)  2018    C      CC  22.8  66.8
 7 d      [2011,2016)  2013    D      DD  33.8  72.2
 8 d      [2016,2021)  2018    D      DD  20.0  83.7
 9 e      [2011,2016)  2013    E      EE  14.9  71.0
10 e      [2016,2021)  2018    E      EE  19.6  70.4

Thank you in advance!

CodePudding user response：

The warnning results from that summarise_all(mean) calculates averages not only on var1 & var2 but on state & country. If you want to keep state and country as grouping columns, you should put them into group_by():

library(dplyr)

df %>%
  group_by(county, state, country,
           period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
  summarise_all(mean) %>%
  ungroup()

# # A tibble: 10 × 7
#    county state country period       year  var1  var2
#    <chr>  <chr> <chr>   <fct>       <dbl> <dbl> <dbl>
#  1 a      A     AA      [2011,2016)  2013  33.1  69.7
#  2 a      A     AA      [2016,2021)  2018  24.7  73.6
#  3 b      B     BB      [2011,2016)  2013  27.6  72.3
#  4 b      B     BB      [2016,2021)  2018  24.7  83.1
#  5 c      C     CC      [2011,2016)  2013  38.7  75.7
#  6 c      C     CC      [2016,2021)  2018  22.8  66.8
#  7 d      D     DD      [2011,2016)  2013  33.8  72.2
#  8 d      D     DD      [2016,2021)  2018  20.0  83.7
#  9 e      E     EE      [2011,2016)  2013  14.9  71.0
# 10 e      E     EE      [2016,2021)  2018  19.6  70.4

CodePudding user response：

Consider group_by without summarise to aggregate columns but keep original rows. Given you may have many var columns, use mutate_at to run mean across many columns.

df_county_period_means <- df %>% 
  mutate(
    period = cut(
      year, seq(2011, 2021, by = 5), right = FALSE)
  ) %>% 
  group_by(county, period) %>% 
  mutate_at(
    .vars = vars(var1:var2),
    .funs = list(mean = ~mean(.))
  ) 

df_county_period_means
# A tibble: 50 x 9
# Groups:   county, period [10]
    year county state country  var1  var2 period      var1_mean var2_mean
   <int> <chr>  <chr> <chr>   <dbl> <dbl> <fct>           <dbl>     <dbl>
 1  2011 a      A     AA      14.4   52.3 [2011,2016)      33.1      69.7
 2  2012 a      A     AA      39.4   72.1 [2011,2016)      33.1      69.7
 3  2013 a      A     AA      20.4   89.9 [2011,2016)      33.1      69.7
 4  2014 a      A     AA      44.2   56.1 [2011,2016)      33.1      69.7
 5  2015 a      A     AA      47.0   78.0 [2011,2016)      33.1      69.7
 6  2016 a      A     AA       2.28  60.3 [2016,2021)      24.7      73.6
 7  2017 a      A     AA      26.4   56.4 [2016,2021)      24.7      73.6
 8  2018 a      A     AA      44.6   87.7 [2016,2021)      24.7      73.6
 9  2019 a      A     AA      27.6   94.8 [2016,2021)      24.7      73.6
10  2020 a      A     AA      22.8   68.7 [2016,2021)      24.7      73.6