My data looks like this:
job id date response
ny45 088 2021-11-19 1
ny21 088 2021-10-12 0
ny22 088 2021-10-12 0
ny23 017 2020-09-16 0
ny23 014 2020-09-16 1
ny90 017 2020-07-15 1
ny67 017 2020-06-02 1
I want to calculate two things:
(1) The number of occurrences for each id
for each date
in the past year. If the same id
has the same date
, I would like this to count at 0 past occurrence (as it appears on the same day). Something like this:
job id date response occurrences
ny45 088 2021-11-19 1 2
ny21 088 2021-10-12 0 0
ny22 088 2021-10-12 0 0
ny23 017 2020-09-16 0 2
ny23 014 2020-09-16 1 0
ny90 017 2020-07-15 1 1
ny67 017 2020-06-02 1 0
(2) The number of responses (response == 1
) within the past year. Again, if the same id
has the same date
, then in this case I would like this to count as 0. I also want to only count previous responses (i.e. not for the current date). Like this:
job id date response occurrences response_count
ny45 088 2021-11-19 1 2 0
ny21 088 2021-10-12 0 0 0
ny22 088 2021-10-12 0 0 0
ny23 017 2020-09-16 0 2 2
ny23 014 2020-09-16 1 0 0
ny90 017 2020-07-15 1 1 1
ny67 017 2020-06-02 1 0 0
Any help would be greatly appreciated! Thank you.
CodePudding user response:
Here is one way to do this -
library(tidyverse)
library(lubridate)
df %>%
mutate(date = as.Date(date)) %>%
group_by(id) %>%
mutate(data = map(date, ~list(
occurrences = sum(between(date, .x - years(1), .x - 1)),
response_count = sum(response[between(date, .x - years(1),.x - 1)])))) %>%
ungroup %>%
unnest_wider(data)
# job id date response occurrences response_count
# <chr> <int> <date> <int> <int> <int>
#1 ny45 88 2021-11-19 1 2 0
#2 ny21 88 2021-10-12 0 0 0
#3 ny22 88 2021-10-12 0 0 0
#4 ny23 17 2020-09-16 0 2 2
#5 ny23 14 2020-09-16 1 0 0
#6 ny90 17 2020-07-15 1 1 1
#7 ny67 17 2020-06-02 1 0 0
data*
df <- structure(list(job = c("ny45", "ny21", "ny22", "ny23", "ny23",
"ny90", "ny67"), id = c(88L, 88L, 88L, 17L, 14L, 17L, 17L),date = c("2021-11-19",
"2021-10-12", "2021-10-12", "2020-09-16", "2020-09-16", "2020-07-15",
"2020-06-02"), response = c(1L, 0L, 0L, 0L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA, -7L))
CodePudding user response:
Another solution
library(lubridate)
df$date=as_datetime(df$date)
cbind(
df,
t(
sapply(1:nrow(df),function(i){
tmp1=time_length(difftime(df$date[i],df$date),"years")
tmp2=df$id==df$id[i] & tmp1>0 & tmp1<1
data.frame(
"occurences"=sum(tmp2),
"response_count"=sum(df$response[tmp2])
)
})
)
)
job id date response occurences response_count
1 ny45 88 2021-11-19 1 2 0
2 ny21 88 2021-10-12 0 0 0
3 ny22 88 2021-10-12 0 0 0
4 ny23 17 2020-09-16 0 2 2
5 ny23 14 2020-09-16 1 0 0
6 ny90 17 2020-07-15 1 1 1
7 ny67 17 2020-06-02 1 0 0