New to r and I'm having difficulty getting the counts I'm after. I have a dataset that contains several columns of various counts per year. Here is an example:
huc_code_eight | year | count_1 | count_2 |
---|---|---|---|
6010105 | 1946 | 4 | 4 |
6010105 | 1947 | 6 | 0 |
6010105 | 1948 | 2 | 0 |
6010105 | 1957 | 4 | 4 |
6020001 | 1957 | 2 | 0 |
8010203 | 1957 | 0 | 0 |
I want to aggregate these counts based upon consecutive years, grouped by huc_code_eight
. The expected output would look like:
huc_code_eight | year | count_1 | count_2 |
---|---|---|---|
6010105 | 1946 - 1948 | 12 | 4 |
6010105 | 1957 | 4 | 4 |
6020001 | 1957 | 2 | 0 |
8010203 | 1957 | 0 | 0 |
I would like to avoid iterating through the data and summing these manually, but, though I've found many examples of aggregating in r, I've been unable to successfully refactor them to fit my use case.
Any help would be greatly appreciated!
CodePudding user response:
We can create a grouping column based on diff
erence of adjacent elements in 'year' along with 'huc_code_eight' and then summarise
library(dplyr)
library(stringr)
df1 %>%
group_by(huc_code_eight) %>%
mutate(year_grp = cumsum(c(TRUE, diff(year) != 1))) %>%
group_by(year_grp, .add = TRUE) %>%
summarise(year = if(n() > 1)
str_c(range(year), collapse = ' - ') else as.character(year),
across(starts_with('count'), sum, na.rm = TRUE), .groups = 'drop') %>%
dplyr::select(-year_grp)
-output
# A tibble: 4 × 4
huc_code_eight year count_1 count_2
<int> <chr> <int> <int>
1 6010105 1946 - 1948 12 4
2 6010105 1957 4 4
3 6020001 1957 2 0
4 8010203 1957 0 0
data
df1 <- structure(list(huc_code_eight = c(6010105L, 6010105L, 6010105L,
6010105L, 6020001L, 8010203L), year = c(1946L, 1947L, 1948L,
1957L, 1957L, 1957L), count_1 = c(4L, 6L, 2L, 4L, 2L, 0L), count_2 = c(4L,
0L, 0L, 4L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
CodePudding user response:
Here is a data.table approach
- set as data.table,, get the subsequent year, set to 1 if NA, and create run-length id
dat <- setDT(dat)[, yr:= year-shift(year),by=huc_code_eight][is.na(yr), yr:=1][,grp:=rleid(huc_code_eight,yr)]
- create the character year (range if necessary, and sum of counts, by id
dat[,.(
year = fifelse(.N>1,paste0(min(year),"-",max(year)),paste0(year, collapse="")),
count_1=sum(count_1),count_2=sum(count_2)),
by=.(grp,huc_code_eight)][,grp:=NULL][]
Output:
huc_code_eight year count_1 count_2
1: 6010105 1946-1948 12 4
2: 6010105 1957 4 4
3: 6020001 1957 2 0
4: 8010203 1957 0 0