Aggregate over consecutive years-CodePudding

New to r and I'm having difficulty getting the counts I'm after. I have a dataset that contains several columns of various counts per year. Here is an example:

huc_code_eight	year	count_1	count_2
6010105	1946	4	4
6010105	1947	6	0
6010105	1948	2	0
6010105	1957	4	4
6020001	1957	2	0
8010203	1957	0	0

I want to aggregate these counts based upon consecutive years, grouped by huc_code_eight. The expected output would look like:

huc_code_eight	year	count_1	count_2
6010105	1946 - 1948	12	4
6010105	1957	4	4
6020001	1957	2	0
8010203	1957	0	0

I would like to avoid iterating through the data and summing these manually, but, though I've found many examples of aggregating in r, I've been unable to successfully refactor them to fit my use case.

Any help would be greatly appreciated!

CodePudding user response：

We can create a grouping column based on difference of adjacent elements in 'year' along with 'huc_code_eight' and then summarise

library(dplyr)
library(stringr)
df1 %>% 
   group_by(huc_code_eight) %>%
   mutate(year_grp = cumsum(c(TRUE, diff(year) != 1))) %>% 
   group_by(year_grp, .add = TRUE) %>% 
   summarise(year = if(n() > 1) 
     str_c(range(year), collapse = ' - ') else as.character(year), 
      across(starts_with('count'), sum, na.rm = TRUE), .groups = 'drop') %>% 
   dplyr::select(-year_grp)

-output

# A tibble: 4 × 4
  huc_code_eight year        count_1 count_2
           <int> <chr>         <int>   <int>
1        6010105 1946 - 1948      12       4
2        6010105 1957              4       4
3        6020001 1957              2       0
4        8010203 1957              0       0

data

df1 <- structure(list(huc_code_eight = c(6010105L, 6010105L, 6010105L, 
6010105L, 6020001L, 8010203L), year = c(1946L, 1947L, 1948L, 
1957L, 1957L, 1957L), count_1 = c(4L, 6L, 2L, 4L, 2L, 0L), count_2 = c(4L, 
0L, 0L, 4L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-6L))

CodePudding user response：

Here is a data.table approach

set as data.table,, get the subsequent year, set to 1 if NA, and create run-length id

dat <- setDT(dat)[, yr:= year-shift(year),by=huc_code_eight][is.na(yr), yr:=1][,grp:=rleid(huc_code_eight,yr)]

create the character year (range if necessary, and sum of counts, by id

dat[,.(
  year = fifelse(.N>1,paste0(min(year),"-",max(year)),paste0(year, collapse="")),
  count_1=sum(count_1),count_2=sum(count_2)),
  by=.(grp,huc_code_eight)][,grp:=NULL][]

Output:

   huc_code_eight      year count_1 count_2
1:        6010105 1946-1948      12       4
2:        6010105      1957       4       4
3:        6020001      1957       2       0
4:        8010203      1957       0       0