Ok so we have some pretty standard data that looks like this with a date and user id column, but the id can occur multiple times in a day:
id Date
as7fyaisdf 2017-11-08
p98ashdfp9 2017-11-08
p98ashdfp9 2017-11-08
p98ashdfp9 2017-11-08
2984oinrv7 2017-11-08
as7fyaisdf 2017-11-09
p98ashdfp9 2017-11-09
2984oinrv7 2017-11-09
9asjenorin 2017-11-09
I want to get a running count that shows the cumulative number of times a given id has occurred over time. But I only want it to count a single day once. So it would look like this in this case:
id Date running_count
as7fyaisdf 2017-11-08 1
p98ashdfp9 2017-11-08 1
p98ashdfp9 2017-11-08 1
p98ashdfp9 2017-11-08 1
2984oinrv7 2017-11-08 1
as7fyaisdf 2017-11-09 2
p98ashdfp9 2017-11-09 2
2984oinrv7 2017-11-09 2
9asjenorin 2017-11-09 1
I feel like this probably involves the rle() function or data.table's rleid(), but I haven't been able to crack it. I'd like to do this within the tidyverse if possible, but open to other options in the R universe. Would like to keep this in any help is appreciated.
CodePudding user response:
You could group by id
and get the row_number
:
library(tidyverse)
df %>%
left_join(distinct(.) %>%
group_by(id) %>%
mutate(running_count = row_number()))
id Date running_count
1 as7fyaisdf 2017-11-08 1
2 p98ashdfp9 2017-11-08 1
3 p98ashdfp9 2017-11-08 1
4 p98ashdfp9 2017-11-08 1
5 2984oinrv7 2017-11-08 1
6 as7fyaisdf 2017-11-09 2
7 p98ashdfp9 2017-11-09 2
8 2984oinrv7 2017-11-09 2
9 9asjenorin 2017-11-09 1
CodePudding user response:
Since you asked about data.table
:
using rleid
dt[order(id,Date),running_count:=rleid(Date),by=id][]
or joining with unique
, like the tidyverse solution, which used distinct()
dt[unique(dt)[,running_count:=1:.N, by=.(id)], on=.(id, Date)]
Both option result in this output:
id Date running_count
1: as7fyaisdf 2017-11-08 1
2: p98ashdfp9 2017-11-08 1
3: p98ashdfp9 2017-11-08 1
4: p98ashdfp9 2017-11-08 1
5: 2984oinrv7 2017-11-08 1
6: as7fyaisdf 2017-11-09 2
7: p98ashdfp9 2017-11-09 2
8: 2984oinrv7 2017-11-09 2
9: 9asjenorin 2017-11-09 1