Calculating interval of time in sessions of users in R-CodePudding

Im really new to Rstudio and i found a code that would help me analyze a dataset. Could anybody explain a little bit more detailed this line replace_na(interval(lag(Date), Date)/dminutes(1), 0) and also the meaning of .add = T) so that i can implement it in my code.

session_time_limit <- 30
df <- tibble(ID = c("a", "a", "a", "a", "b", "b", "c", "c", "c"),
             Date = c(as.POSIXct("2021-01-25 19:17:12 UTC"), #a1
                      as.POSIXct("2021-01-25 19:17:30 UTC"), #a2
                      as.POSIXct("2021-01-25 19:57:12 UTC"), #a3
                      as.POSIXct("2021-01-25 19:59:12 UTC"), #a4
                      as.POSIXct("2021-01-25 20:11:12 UTC"), #b1
                      as.POSIXct("2021-01-25 20:42:12 UTC"), #b2
                      as.POSIXct("2021-01-25 21:15:42 UTC"), #c1
                      as.POSIXct("2021-01-25 21:17:12 UTC"), #c2
                      as.POSIXct("2021-01-25 21:20:13 UTC"))) #c3

df %>% 
  group_by(ID) %>%
  mutate(tdiff = replace_na(interval(lag(Date), Date)/dminutes(1), 0),
         session_number = cumsum(tdiff > session_time_limit)   1) %>% 
  group_by(session_number, .add = T) %>% 
  mutate(activity_within_session = row_number()) %>% 
  ungroup()

CodePudding user response：

This may be something you already know, but a great way to get information on a function is to do ?nameoffunction. So in this case ?interval, ?lag could give you a start. That being said, I know these help sections often confuse me so an extra explanation as you requested makes sense.

interval creates an interval as you may expect, so it's a way of specifying two times. For example, using interval(as.POSIXct("2021-01-25 19:17:12 UTC"), as.POSIXct("2021-01-25 19:17:30 UTC")), the first two dates you have, will create the interval 2021-01-25 19:17:12 PST--2021-01-25 19:17:30 PST. Personally I use intervals to see if a date falls within a certain timeframe. In your code, we also have lag, a function which will take the data from the previous row. Thus, if we were on row two, lag(Date) would provide the date from row 1, and Date would provide the date from row 2, and interval(lag(Date), Date) would create the interval I mentioned above.

Then the next part is dminutes(1). This produces a duration object of one minute, or 60 seconds. In this code, we are taking the interval, say between row one and two of 18 seconds, and dividing that by one minute. This produces a 0.3 for the tdiff, meaning 0.3 minutes.

But what if we were on row one? There would be no previous row. In that case the function would see interval(NA, as.POSIXct("2021-01-25 19:17:30 UTC")) which would make the interval NA--NA. Dividing that by one minute will produce NA. This is where replace_NA could come in handy, anything that is NA would be switched to the specified 0.

Bringing this together, every row of tdiff will give you the number of minutes between itself and the previous row of each group, or 0 if it is the first row or no time elapses presumably.

As for the .add = T, I think that might be an error. In group_by, add is an argument, but not .add as far as I'm aware. In help it says add When add = FALSE, the default, group_by() will override existing groups. To add to the existing groups, use add = TRUE. Which I think means if add = T, and you make a group within a group, it will have two sets of groups. If you had add = F, then the first set of group would be ungrouped, and only the second group would exist. In the given code though, .add = T I think might just create an extra column titled .add with all data points listed as TRUE.

This is how I understand these functions. Hopefully that helps or points you in the right direction. Best of luck!

CodePudding user response：

Welcome to SO! It is very easy. But first something important, this code is using functions from lubridate (date operations) and tidyverse (dataframe processing) packages. So you have to (install if you didnt) and import them:

library(lubridate)
library(tidyverse)

1. ¿What tdiff = replace_na(interval(lag(Date), Date)/dminutes(1), 0) does?

replace_na(data, replace,...) just replace the NA values in data for the value as replace input. So, let`s have a look their inputs:

-> If we call: interval(lag(Date), Date)/dminutes(1), this is the output:

[1] NA 0.300000 39.700000 2.000000 12.000000 31.000000 33.500000 1.500000 [9] 3.01666

So, replace_na()is changing the NA value by 0.

[1] 0.000000 0.300000 39.700000 2.000000 12.000000 31.000000 33.500000 1.500000 [9] 3.016667

lag() funtion is useful when you want to do operations in a row based in another row, so it shift the column Date by -1 (the previous one). If you try this:

interval(df$Date[1],df$Date[2])/dminutes(1)

You will see the output is 0.3 (same as the second row of mutated column tdiff). Why do we need replace_na()? because if you use lag(), for the first row lag() is shifting to the previous one which doesn't exist, so you get a NA value that you don't want. So you replace by 0.

2. What group_by(session_number, .add = T) does?

Here, you are ussing .add = T because you already grouped the dataframe by ID when you called >%> group_by(ID). If you don't want to lose that grouping you set .add = T (as TRUE) to say the group_by() function to group the df again based on your previous df grouping. .add is set = F by the function so if you don't say that, then you create a new group_by() from the beginning.