I'm fairly new to R and I've been struggling with how to make this work:
I'm looking to pool the population numbers across several years (to match a smaller pooled dataset). So instead of having yearly population data, I would combine rows in "three-year" intervals and take an average (eg. (2007 2008 2009)/3). This would be done for all regions (columns).
My current dataframe:
year | CAN | ONT | AL | BC | atlantic | mansask |
---|---|---|---|---|---|---|
2007 | 32887928 | 12764195 | 3514031 | 4290988 | 2327238 | 2191414 |
2008 | 33245773 | 12882625 | 3595755 | 4349412 | 2333027 | 2215120 |
2009 | 33628571 | 12997687 | 3679092 | 4410679 | 2344786 | 2243371 |
2010 | 34005274 | 13135063 | 3732573 | 4465924 | 2358767 | 2272355 |
2011 | 34342780 | 13263544 | 3790191 | 4499139 | 2369074 | 2300077 |
2012 | 34750545 | 13413702 | 3880755 | 4546290 | 2373250 | 2336283 |
2013 | 35152370 | 13555754 | 3997950 | 4590081 | 2371356 | 2370413 |
2014 | 35535348 | 13680425 | 4108416 | 4646462 | 2371210 | 2401551 |
2015 | 35832513 | 13789597 | 4177527 | 4694699 | 2371095 | 2426572 |
2016 | 36264604 | 13976320 | 4236376 | 4757658 | 2385779 | 2466703 |
My ideal transformation (I didn't rework the numbers in this example):
year | CAN | ONT | AL | BC | atlantic | mansask |
---|---|---|---|---|---|---|
2007-2009 | 32887928 | 12764195 | 3514031 | 4290988 | 2327238 | 2191414 |
2010-2012 | 33245773 | 12882625 | 3595755 | 4349412 | 2333027 | 2215120 |
2013-2015 | 33628571 | 12997687 | 3679092 | 4410679 | 2344786 | 2243371 |
2016-2018 | 34005274 | 13135063 | 3732573 | 4465924 | 2358767 | 2272355 |
Thanks in advance for any advice/help!
CodePudding user response:
Some simple dpylr work does the trick. Let me know if this works:
library(dplyr)
library(tibble)
tibble::tribble(
~year, ~CAN, ~ONT, ~AL, ~BC, ~atlantic, ~mansask,
2007L, 32887928L, 12764195L, 3514031L, 4290988L, 2327238L, 2191414L,
2008L, 33245773L, 12882625L, 3595755L, 4349412L, 2333027L, 2215120L,
2009L, 33628571L, 12997687L, 3679092L, 4410679L, 2344786L, 2243371L,
2010L, 34005274L, 13135063L, 3732573L, 4465924L, 2358767L, 2272355L,
2011L, 34342780L, 13263544L, 3790191L, 4499139L, 2369074L, 2300077L,
2012L, 34750545L, 13413702L, 3880755L, 4546290L, 2373250L, 2336283L,
2013L, 35152370L, 13555754L, 3997950L, 4590081L, 2371356L, 2370413L,
2014L, 35535348L, 13680425L, 4108416L, 4646462L, 2371210L, 2401551L,
2015L, 35832513L, 13789597L, 4177527L, 4694699L, 2371095L, 2426572L,
2016L, 36264604L, 13976320L, 4236376L, 4757658L, 2385779L, 2466703L
) %>%
group_by(row_num = row_number() %/% 3.1) %>%
summarise(across(!starts_with("year"), mean),
year = paste(first(year), last(year), sep = "-")) %>%
select(year, everything(), -row_num)
# # A tibble: 4 × 7
# year CAN ONT AL BC atlantic mansask
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2007-2009 33254091. 12881502. 3596293. 4350360. 2335017 2216635
# 2 2010-2012 34366200. 13270770. 3801173 4503784. 2367030. 2302905
# 3 2013-2015 35506744. 13675259. 4094631 4643747. 2371220. 2399512
# 4 2016-2016 36264604 13976320 4236376 4757658 2385779 2466703