group_by and conditionally mutate count-CodePudding

I have the following example data frame (the original has more than 200,000 rows:

df <- tribble(~ id, ~ birth_year,
                 1,         2015,
                 1,         2015,
                 1,         2016,
                 2,         2008,
                 2,         2010,
                 3,         2010,
                 3,         2014,
                 3,         2014,
                 3,         2014,
                 4,         2010,
                 4,         2011,
                 4,         2012,
                 4,         2013)

I want to group by id and count how much times a birth year is repeated with the id. Ideally obtaining this:

df_wanted <- tribble(~ id, ~ birth_year, ~ n,
                        1,         2015,   2,
                        1,         2015,   2,
                        1,         2016,   2,
                        2,         2008,   0,
                        2,         2010,   0,
                        3,         2010,   3,  
                        3,         2014,   3,
                        3,         2014,   3,
                        3,         2014,   3,
                        4,         2010,   0,
                        4,         2011,   0,
                        4,         2012,   0,
                        4,         2013,   0)

Any suggestions? I came across conditionally counting specific values but no repeated cases like my example.

CodePudding user response：

When you say, "group by id and count" do you mean aggregating? if so, I believe this would do the trick:

df %>% 
  count(id, birth_year)

If you mean you want to add a column with the counts of id and birth year (like your df_wanted shows), the the add_count function is what you're looking for:

df %>% 
  add_count(id, birth_year)

You can learn a lot of this from the R for data science book, or the dplyr documentation.

CodePudding user response：

With dplyr you can try the following. If there are no duplicates, will set n to zero. If there are, then use table to get frequencies of birth_year, and then use mutate to store the max number.

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(n = ifelse(anyDuplicated(birth_year), max(table(birth_year)), 0))

Output

      id birth_year     n
   <dbl>      <dbl> <dbl>
 1     1       2015     2
 2     1       2015     2
 3     1       2016     2
 4     2       2008     0
 5     2       2010     0
 6     3       2010     3
 7     3       2014     3
 8     3       2014     3
 9     3       2014     3
10     4       2010     0
11     4       2011     0
12     4       2012     0
13     4       2013     0

An alternative with data.table which may be faster:

library(data.table)

setDT(df)[, n := ifelse(anyDuplicated(birth_year), max(table(birth_year)), 0), id]

Or base R:

within(df, { n <- ave(birth_year, id, FUN = function(x) ifelse(anyDuplicated(x), max(table(x)), 0)) })