I have the following example data frame (the original has more than 200,000 rows:
df <- tribble(~ id, ~ birth_year,
1, 2015,
1, 2015,
1, 2016,
2, 2008,
2, 2010,
3, 2010,
3, 2014,
3, 2014,
3, 2014,
4, 2010,
4, 2011,
4, 2012,
4, 2013)
I want to group by id and count how much times a birth year is repeated with the id. Ideally obtaining this:
df_wanted <- tribble(~ id, ~ birth_year, ~ n,
1, 2015, 2,
1, 2015, 2,
1, 2016, 2,
2, 2008, 0,
2, 2010, 0,
3, 2010, 3,
3, 2014, 3,
3, 2014, 3,
3, 2014, 3,
4, 2010, 0,
4, 2011, 0,
4, 2012, 0,
4, 2013, 0)
Any suggestions? I came across conditionally counting specific values but no repeated cases like my example.
CodePudding user response:
When you say, "group by id and count" do you mean aggregating? if so, I believe this would do the trick:
df %>%
count(id, birth_year)
If you mean you want to add a column with the counts of id and birth year (like your df_wanted
shows), the the add_count
function is what you're looking for:
df %>%
add_count(id, birth_year)
You can learn a lot of this from the R for data science book, or the dplyr documentation.
CodePudding user response:
With dplyr
you can try the following. If there are no duplicates, will set n
to zero. If there are, then use table
to get frequencies of birth_year
, and then use mutate
to store the max
number.
library(dplyr)
df %>%
group_by(id) %>%
mutate(n = ifelse(anyDuplicated(birth_year), max(table(birth_year)), 0))
Output
id birth_year n
<dbl> <dbl> <dbl>
1 1 2015 2
2 1 2015 2
3 1 2016 2
4 2 2008 0
5 2 2010 0
6 3 2010 3
7 3 2014 3
8 3 2014 3
9 3 2014 3
10 4 2010 0
11 4 2011 0
12 4 2012 0
13 4 2013 0
An alternative with data.table
which may be faster:
library(data.table)
setDT(df)[, n := ifelse(anyDuplicated(birth_year), max(table(birth_year)), 0), id]
Or base R:
within(df, { n <- ave(birth_year, id, FUN = function(x) ifelse(anyDuplicated(x), max(table(x)), 0)) })