Group by unique ids to calculate number of days in R-CodePudding

i want to find how many days a unique id has worked in this df by abstracting the max date minus min date.

structure(list(id = c("f1", "f2", "f2", "f4", "f5", "f2", "f3", 
"f1", "f1", "f3", "f4", "f2", "f2", "f2", "f2"), 
 date = structure(c(18687, 18687, 18687, 18687, 18687, 
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18687, 18687, 
18687), class = "Date")), row.names = c(2L, 4L, 8L, 15L, 17L, 
18L, 21L, 25L, 36L, 37L, 38L, 40L, 42L, 48L, 52L), class = "data.frame")

I have tried this but I get weird numbers

df_total_days_per_id<-df %>%
  group_by(id) %>%
  mutate(xx1 = max(date)-min(date)) %>% #by user find range
  group_by(id) %>%
  summarise(number_of_days = sum(xx1))

CodePudding user response：

Based on your code I can make an estimate of why you are getting "weird numbers".

The function mutate() adds a new column to the existing table. Let's forget about groups for a moment and assume the following dates:

2022-01-01 
2022-01-11 
2022-01-21

You want the answer to be 20, right? 21 minus 1. However with your mutate() function you create this:

2022-01-01   20 days
2022-01-11   20 days
2022-01-21   20 days

and then in the last operation you sum all these 20s to reach 60 days. Obviously this is wrong.

Instead of using mutate() we use summarize() like so:

df %>%
  group_by(id) %>%
  summarize(number_of_days = max(date)-min(date) 1) #added  1 for correction

after this there is no need to add another group_by() or sum()

Edit:

I noticed that you are talking about the number of days worked. So if we take the example above you might instead want to output 3 instead of 20 because there are days inbetween that were skipped. If this is the case you should use a different function in your summarize() altogether.

For this we can use the function n_distinct() which counts the amount of unique values in a group, like so:

df %>%
  group_by(id) %>%
  summarise(number_of_days = n_distinct(date))