Home > Enterprise >  is there an R function for pooling data by rows?
is there an R function for pooling data by rows?

Time:07-22

I'm fairly new to R and I've been struggling with how to make this work:

I'm looking to pool the population numbers across several years (to match a smaller pooled dataset). So instead of having yearly population data, I would combine rows in "three-year" intervals and take an average (eg. (2007 2008 2009)/3). This would be done for all regions (columns).

My current dataframe:

year CAN ONT AL BC atlantic mansask
2007 32887928 12764195 3514031 4290988 2327238 2191414
2008 33245773 12882625 3595755 4349412 2333027 2215120
2009 33628571 12997687 3679092 4410679 2344786 2243371
2010 34005274 13135063 3732573 4465924 2358767 2272355
2011 34342780 13263544 3790191 4499139 2369074 2300077
2012 34750545 13413702 3880755 4546290 2373250 2336283
2013 35152370 13555754 3997950 4590081 2371356 2370413
2014 35535348 13680425 4108416 4646462 2371210 2401551
2015 35832513 13789597 4177527 4694699 2371095 2426572
2016 36264604 13976320 4236376 4757658 2385779 2466703

My ideal transformation (I didn't rework the numbers in this example):

year CAN ONT AL BC atlantic mansask
2007-2009 32887928 12764195 3514031 4290988 2327238 2191414
2010-2012 33245773 12882625 3595755 4349412 2333027 2215120
2013-2015 33628571 12997687 3679092 4410679 2344786 2243371
2016-2018 34005274 13135063 3732573 4465924 2358767 2272355

Thanks in advance for any advice/help!

CodePudding user response:

Some simple dpylr work does the trick. Let me know if this works:

library(dplyr)
library(tibble)

tibble::tribble(
  ~year,      ~CAN,      ~ONT,      ~AL,      ~BC, ~atlantic, ~mansask,
  2007L, 32887928L, 12764195L, 3514031L, 4290988L,  2327238L, 2191414L,
  2008L, 33245773L, 12882625L, 3595755L, 4349412L,  2333027L, 2215120L,
  2009L, 33628571L, 12997687L, 3679092L, 4410679L,  2344786L, 2243371L,
  2010L, 34005274L, 13135063L, 3732573L, 4465924L,  2358767L, 2272355L,
  2011L, 34342780L, 13263544L, 3790191L, 4499139L,  2369074L, 2300077L,
  2012L, 34750545L, 13413702L, 3880755L, 4546290L,  2373250L, 2336283L,
  2013L, 35152370L, 13555754L, 3997950L, 4590081L,  2371356L, 2370413L,
  2014L, 35535348L, 13680425L, 4108416L, 4646462L,  2371210L, 2401551L,
  2015L, 35832513L, 13789597L, 4177527L, 4694699L,  2371095L, 2426572L,
  2016L, 36264604L, 13976320L, 4236376L, 4757658L,  2385779L, 2466703L
  ) %>%
  group_by(row_num = row_number() %/% 3.1) %>%
  summarise(across(!starts_with("year"), mean),
            year = paste(first(year), last(year), sep = "-")) %>%
  select(year, everything(), -row_num)

# # A tibble: 4 × 7
#   year            CAN       ONT       AL       BC atlantic mansask
#   <chr>         <dbl>     <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
# 1 2007-2009 33254091. 12881502. 3596293. 4350360. 2335017  2216635
# 2 2010-2012 34366200. 13270770. 3801173  4503784. 2367030. 2302905
# 3 2013-2015 35506744. 13675259. 4094631  4643747. 2371220. 2399512
# 4 2016-2016 36264604  13976320  4236376  4757658  2385779  2466703
  • Related