The following code counts the number of unique IDs per year. My question is: how to count the number of new unique IDs, i.e., IDs that did not appear in previous years?
group_by(year) %>%
summarize(count=n_distinct(ID))
For example, I need to create the variable wanted_count below
Year | ID | count | wanted_count |
---|---|---|---|
2000 | 1 | 3 | 3 |
2000 | 2 | 3 | 3 |
2000 | 3 | 3 | 3 |
2001 | 2 | 2 | 0 |
2001 | 3 | 2 | 0 |
2002 | 3 | 2 | 1 |
2002 | 4 | 2 | 1 |
2003 | 4 | 2 | 1 |
2003 | 7 | 2 | 1 |
2003 | 4 | 2 | 1 |
See data below:
structure(list(Year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2002L,
2002L, 2003L, 2003L, 2003L), ID = c(1L, 2L, 3L, 2L, 3L, 3L, 4L,
4L, 7L, 4L), count = c(3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), wanted_count = c(3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
CodePudding user response:
library(dplyr)
df %>%
mutate(cum_new = cumsum(!duplicated(ID))) %>%
group_by(Year) %>%
summarize(total = max(cum_new), .groups = "drop") %>%
mutate(
result = c(first(total), diff(total)),
total = NULL
) %>%
left_join(df, by = "Year")
# # A tibble: 10 × 5
# Year result ID count wanted_count
# <int> <int> <int> <int> <int>
# 1 2000 3 1 3 3
# 2 2000 3 2 3 3
# 3 2000 3 3 3 3
# 4 2001 0 2 2 0
# 5 2001 0 3 2 0
# 6 2002 1 3 2 1
# 7 2002 1 4 2 1
# 8 2003 1 4 2 1
# 9 2003 1 7 2 1
# 10 2003 1 4 2 1
Using this data:
df = read.table(text = 'Year ID count wanted_count
2000 1 3 3
2000 2 3 3
2000 3 3 3
2001 2 2 0
2001 3 2 0
2002 3 2 1
2002 4 2 1
2003 4 2 1
2003 7 2 1
2003 4 2 1', header = T)
CodePudding user response:
You could create a logical column indicating occurrences of new ID
, and sum up them in each Year
group.
library(dplyr)
df %>%
mutate(new = !duplicated(ID)) %>%
add_count(Year, wt = new) %>%
select(-new)
# Year ID n
# 1 2000 1 3
# 2 2000 2 3
# 3 2000 3 3
# 4 2001 2 0
# 5 2001 3 0
# 6 2002 3 1
# 7 2002 4 1
# 8 2003 4 1
# 9 2003 7 1
# 10 2003 4 1