Home > Back-end >  Aggregate over consecutive years
Aggregate over consecutive years

Time:02-28

New to r and I'm having difficulty getting the counts I'm after. I have a dataset that contains several columns of various counts per year. Here is an example:

huc_code_eight year count_1 count_2
6010105 1946 4 4
6010105 1947 6 0
6010105 1948 2 0
6010105 1957 4 4
6020001 1957 2 0
8010203 1957 0 0

I want to aggregate these counts based upon consecutive years, grouped by huc_code_eight. The expected output would look like:

huc_code_eight year count_1 count_2
6010105 1946 - 1948 12 4
6010105 1957 4 4
6020001 1957 2 0
8010203 1957 0 0

I would like to avoid iterating through the data and summing these manually, but, though I've found many examples of aggregating in r, I've been unable to successfully refactor them to fit my use case.

Any help would be greatly appreciated!

CodePudding user response:

We can create a grouping column based on difference of adjacent elements in 'year' along with 'huc_code_eight' and then summarise

library(dplyr)
library(stringr)
df1 %>% 
   group_by(huc_code_eight) %>%
   mutate(year_grp = cumsum(c(TRUE, diff(year) != 1))) %>% 
   group_by(year_grp, .add = TRUE) %>% 
   summarise(year = if(n() > 1) 
     str_c(range(year), collapse = ' - ') else as.character(year), 
      across(starts_with('count'), sum, na.rm = TRUE), .groups = 'drop') %>% 
   dplyr::select(-year_grp)

-output

# A tibble: 4 × 4
  huc_code_eight year        count_1 count_2
           <int> <chr>         <int>   <int>
1        6010105 1946 - 1948      12       4
2        6010105 1957              4       4
3        6020001 1957              2       0
4        8010203 1957              0       0

data

df1 <- structure(list(huc_code_eight = c(6010105L, 6010105L, 6010105L, 
6010105L, 6020001L, 8010203L), year = c(1946L, 1947L, 1948L, 
1957L, 1957L, 1957L), count_1 = c(4L, 6L, 2L, 4L, 2L, 0L), count_2 = c(4L, 
0L, 0L, 4L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-6L))

CodePudding user response:

Here is a data.table approach

  1. set as data.table,, get the subsequent year, set to 1 if NA, and create run-length id
dat <- setDT(dat)[, yr:= year-shift(year),by=huc_code_eight][is.na(yr), yr:=1][,grp:=rleid(huc_code_eight,yr)] 
  1. create the character year (range if necessary, and sum of counts, by id
dat[,.(
  year = fifelse(.N>1,paste0(min(year),"-",max(year)),paste0(year, collapse="")),
  count_1=sum(count_1),count_2=sum(count_2)),
  by=.(grp,huc_code_eight)][,grp:=NULL][]

Output:

   huc_code_eight      year count_1 count_2
1:        6010105 1946-1948      12       4
2:        6010105      1957       4       4
3:        6020001      1957       2       0
4:        8010203      1957       0       0
  •  Tags:  
  • r
  • Related