Get a running count of group over time in r-CodePudding

Ok so we have some pretty standard data that looks like this with a date and user id column, but the id can occur multiple times in a day:

id               Date
as7fyaisdf       2017-11-08
p98ashdfp9       2017-11-08
p98ashdfp9       2017-11-08
p98ashdfp9       2017-11-08
2984oinrv7       2017-11-08
as7fyaisdf       2017-11-09
p98ashdfp9       2017-11-09
2984oinrv7       2017-11-09
9asjenorin       2017-11-09

I want to get a running count that shows the cumulative number of times a given id has occurred over time. But I only want it to count a single day once. So it would look like this in this case:

id               Date           running_count
as7fyaisdf       2017-11-08     1
p98ashdfp9       2017-11-08     1
p98ashdfp9       2017-11-08     1
p98ashdfp9       2017-11-08     1
2984oinrv7       2017-11-08     1
as7fyaisdf       2017-11-09     2
p98ashdfp9       2017-11-09     2
2984oinrv7       2017-11-09     2
9asjenorin       2017-11-09     1

I feel like this probably involves the rle() function or data.table's rleid(), but I haven't been able to crack it. I'd like to do this within the tidyverse if possible, but open to other options in the R universe. Would like to keep this in any help is appreciated.

CodePudding user response：

You could group by id and get the row_number:

library(tidyverse) 
df %>%
   left_join(distinct(.) %>%
   group_by(id) %>%
   mutate(running_count = row_number()))

          id       Date running_count
1 as7fyaisdf 2017-11-08             1
2 p98ashdfp9 2017-11-08             1
3 p98ashdfp9 2017-11-08             1
4 p98ashdfp9 2017-11-08             1
5 2984oinrv7 2017-11-08             1
6 as7fyaisdf 2017-11-09             2
7 p98ashdfp9 2017-11-09             2
8 2984oinrv7 2017-11-09             2
9 9asjenorin 2017-11-09             1

CodePudding user response：

Since you asked about data.table:

using rleid

dt[order(id,Date),running_count:=rleid(Date),by=id][]

or joining with unique, like the tidyverse solution, which used distinct()

dt[unique(dt)[,running_count:=1:.N, by=.(id)], on=.(id, Date)]

Both option result in this output:

           id       Date running_count
1: as7fyaisdf 2017-11-08             1
2: p98ashdfp9 2017-11-08             1
3: p98ashdfp9 2017-11-08             1
4: p98ashdfp9 2017-11-08             1
5: 2984oinrv7 2017-11-08             1
6: as7fyaisdf 2017-11-09             2
7: p98ashdfp9 2017-11-09             2
8: 2984oinrv7 2017-11-09             2
9: 9asjenorin 2017-11-09             1