i want to find how many days a unique id has worked in this df by abstracting the max date minus min date.
structure(list(id = c("f1", "f2", "f2", "f4", "f5", "f2", "f3",
"f1", "f1", "f3", "f4", "f2", "f2", "f2", "f2"),
date = structure(c(18687, 18687, 18687, 18687, 18687,
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18687, 18687,
18687), class = "Date")), row.names = c(2L, 4L, 8L, 15L, 17L,
18L, 21L, 25L, 36L, 37L, 38L, 40L, 42L, 48L, 52L), class = "data.frame")
I have tried this but I get weird numbers
df_total_days_per_id<-df %>%
group_by(id) %>%
mutate(xx1 = max(date)-min(date)) %>% #by user find range
group_by(id) %>%
summarise(number_of_days = sum(xx1))
CodePudding user response:
Based on your code I can make an estimate of why you are getting "weird numbers".
The function mutate()
adds a new column to the existing table. Let's forget about groups for a moment and assume the following dates:
2022-01-01
2022-01-11
2022-01-21
You want the answer to be 20, right? 21 minus 1. However with your mutate()
function you create this:
2022-01-01 20 days
2022-01-11 20 days
2022-01-21 20 days
and then in the last operation you sum all these 20s to reach 60 days. Obviously this is wrong.
Instead of using mutate()
we use summarize()
like so:
df %>%
group_by(id) %>%
summarize(number_of_days = max(date)-min(date) 1) #added 1 for correction
after this there is no need to add another group_by()
or sum()
Edit:
I noticed that you are talking about the number of days worked. So if we take the example above you might instead want to output 3 instead of 20 because there are days inbetween that were skipped. If this is the case you should use a different function in your summarize()
altogether.
For this we can use the function n_distinct()
which counts the amount of unique values in a group, like so:
df %>%
group_by(id) %>%
summarise(number_of_days = n_distinct(date))