I have a panel data set that may look like
set.seed(123)
df <- data.frame(
year = rep(2011:2020,5),
county = rep(c("a","b",'c','d','e'), each=10),
state = rep(c("A","B",'C','D','E'), each=10),
country = rep(c("AA","BB",'CC','DD','EE'), each=10),
var1 = runif(50, 0, 50),
var2 = runif(50, 50, 100)
)
I want to transform the panel data set to 5 year averages of the counties by
df <- df %>%
mutate(period = cut(df$year, seq(2011, 2021, by = 5),right = F)) %>%
group_by(county, period) %>%
summarise_all(mean)
The data set looks like
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 NA NA 33.1 69.7
2 a [2016,2021) 2018 NA NA 24.7 73.6
3 b [2011,2016) 2013 NA NA 27.6 72.3
4 b [2016,2021) 2018 NA NA 24.7 83.1
5 c [2011,2016) 2013 NA NA 38.7 75.7
6 c [2016,2021) 2018 NA NA 22.8 66.8
7 d [2011,2016) 2013 NA NA 33.8 72.2
8 d [2016,2021) 2018 NA NA 20.0 83.7
9 e [2011,2016) 2013 NA NA 14.9 71.0
10 e [2016,2021) 2018 NA NA 19.6 70.4
The warming messages are, for example
In mean.default(state) :
argument is not numeric or logical: returning NA
Is there a smart way (not by merging as actually, I have a lot of character columns) to keep the time-invariant character of each county after the transformation? What I desire is
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 A AA 33.1 69.7
2 a [2016,2021) 2018 A AA 24.7 73.6
3 b [2011,2016) 2013 B BB 27.6 72.3
4 b [2016,2021) 2018 B BB 24.7 83.1
5 c [2011,2016) 2013 C CC 38.7 75.7
6 c [2016,2021) 2018 C CC 22.8 66.8
7 d [2011,2016) 2013 D DD 33.8 72.2
8 d [2016,2021) 2018 D DD 20.0 83.7
9 e [2011,2016) 2013 E EE 14.9 71.0
10 e [2016,2021) 2018 E EE 19.6 70.4
Thank you in advance!
CodePudding user response:
The warnning results from that summarise_all(mean)
calculates averages not only on var1
& var2
but on state
& country
. If you want to keep state
and country
as grouping columns, you should put them into group_by()
:
library(dplyr)
df %>%
group_by(county, state, country,
period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
summarise_all(mean) %>%
ungroup()
# # A tibble: 10 × 7
# county state country period year var1 var2
# <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl>
# 1 a A AA [2011,2016) 2013 33.1 69.7
# 2 a A AA [2016,2021) 2018 24.7 73.6
# 3 b B BB [2011,2016) 2013 27.6 72.3
# 4 b B BB [2016,2021) 2018 24.7 83.1
# 5 c C CC [2011,2016) 2013 38.7 75.7
# 6 c C CC [2016,2021) 2018 22.8 66.8
# 7 d D DD [2011,2016) 2013 33.8 72.2
# 8 d D DD [2016,2021) 2018 20.0 83.7
# 9 e E EE [2011,2016) 2013 14.9 71.0
# 10 e E EE [2016,2021) 2018 19.6 70.4
CodePudding user response:
Consider group_by
without summarise
to aggregate columns but keep original rows. Given you may have many var columns, use mutate_at
to run mean
across many columns.
df_county_period_means <- df %>%
mutate(
period = cut(
year, seq(2011, 2021, by = 5), right = FALSE)
) %>%
group_by(county, period) %>%
mutate_at(
.vars = vars(var1:var2),
.funs = list(mean = ~mean(.))
)
df_county_period_means
# A tibble: 50 x 9
# Groups: county, period [10]
year county state country var1 var2 period var1_mean var2_mean
<int> <chr> <chr> <chr> <dbl> <dbl> <fct> <dbl> <dbl>
1 2011 a A AA 14.4 52.3 [2011,2016) 33.1 69.7
2 2012 a A AA 39.4 72.1 [2011,2016) 33.1 69.7
3 2013 a A AA 20.4 89.9 [2011,2016) 33.1 69.7
4 2014 a A AA 44.2 56.1 [2011,2016) 33.1 69.7
5 2015 a A AA 47.0 78.0 [2011,2016) 33.1 69.7
6 2016 a A AA 2.28 60.3 [2016,2021) 24.7 73.6
7 2017 a A AA 26.4 56.4 [2016,2021) 24.7 73.6
8 2018 a A AA 44.6 87.7 [2016,2021) 24.7 73.6
9 2019 a A AA 27.6 94.8 [2016,2021) 24.7 73.6
10 2020 a A AA 22.8 68.7 [2016,2021) 24.7 73.6