Home > OS >  Using groupby and n_distinct to count NEW unique ids
Using groupby and n_distinct to count NEW unique ids

Time:08-24

The following code counts the number of unique IDs per year. My question is: how to count the number of new unique IDs, i.e., IDs that did not appear in previous years?

  group_by(year) %>% 
  summarize(count=n_distinct(ID))

For example, I need to create the variable wanted_count below

Year ID count wanted_count
2000 1 3 3
2000 2 3 3
2000 3 3 3
2001 2 2 0
2001 3 2 0
2002 3 2 1
2002 4 2 1
2003 4 2 1
2003 7 2 1
2003 4 2 1

See data below:

structure(list(Year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2002L, 
2002L, 2003L, 2003L, 2003L), ID = c(1L, 2L, 3L, 2L, 3L, 3L, 4L, 
4L, 7L, 4L), count = c(3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), wanted_count = c(3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-10L))

CodePudding user response:

library(dplyr)
df %>%
  mutate(cum_new = cumsum(!duplicated(ID))) %>%
  group_by(Year) %>% 
  summarize(total = max(cum_new), .groups = "drop") %>%
  mutate(
    result = c(first(total), diff(total)),
    total = NULL
  ) %>%
  left_join(df, by = "Year")
# # A tibble: 10 × 5
#     Year result    ID count wanted_count
#    <int>  <int> <int> <int>        <int>
#  1  2000      3     1     3            3
#  2  2000      3     2     3            3
#  3  2000      3     3     3            3
#  4  2001      0     2     2            0
#  5  2001      0     3     2            0
#  6  2002      1     3     2            1
#  7  2002      1     4     2            1
#  8  2003      1     4     2            1
#  9  2003      1     7     2            1
# 10  2003      1     4     2            1

Using this data:

df = read.table(text = 'Year    ID  count   wanted_count
2000    1   3   3
2000    2   3   3
2000    3   3   3
2001    2   2   0
2001    3   2   0
2002    3   2   1
2002    4   2   1
2003    4   2   1
2003    7   2   1
2003    4   2   1', header = T)

CodePudding user response:

You could create a logical column indicating occurrences of new ID, and sum up them in each Year group.

library(dplyr)
  
df %>%
  mutate(new = !duplicated(ID)) %>%
  add_count(Year, wt = new) %>%
  select(-new)

#    Year ID n
# 1  2000  1 3
# 2  2000  2 3
# 3  2000  3 3
# 4  2001  2 0
# 5  2001  3 0
# 6  2002  3 1
# 7  2002  4 1
# 8  2003  4 1
# 9  2003  7 1
# 10 2003  4 1
  • Related