create time interval period for numerical time format in each group in R-CodePudding

I want to create time interval period for each group and my time format is numerical. Lets say I want 1 hour interval from the first record and every record within on hour will be in interval 1 and anything after 1 hour and less than 2 hours from the first record will be interval 2 and so on (for each user group).

technically I am looking for creating bins of one hour from the beginning.

df<-read.table(text="
user     timestart
1        1421286975
1        1421287343
1        1421470513
1        1421470513
1        1421471816
1        1421806839
2        1424217068
2        1424217150
2        1424218395",header=T,stringsAsFactors = F)

# result: (might not 100% accurate but you get the point)
user    timestart    interval_1h
1       1421286975     1
1       1421287343     1
1       1421470513     2
1       1421470513     2
1       1421471816     2
1       1421806839     3
2       1424217068     1
2       1424217150     1
2       1424218395     1

CodePudding user response：

For me, there are two interpretations of this question. Here are solutions for both of them. We use dplyr to get the desired output:

The first interpretation creates an output which is similar to your shown output but contradicts your actual question:

df %>% 
  mutate(time = as.POSIXlt(timestart, origin = "1970-01-01")) %>% 
  group_by(user) %>% 
  mutate(grp = cumsum(coalesce(difftime(time, lag(time), units = "hours") >= 1, TRUE))) %>% 
  group_by(user, grp) %>% 
  mutate(grp2 = difftime(time, first(time), units = "hours") >= 1) %>% 
  group_by(user) %>% 
  mutate(grp = grp   cumsum(grp2), .keep = "unused") %>% 
  ungroup()

This returns

# A tibble: 10 x 4
    user  timestart time                  grp
   <int>      <int> <dttm>              <int>
 1     1 1421286975 2015-01-15 02:56:15     1
 2     1 1421287343 2015-01-15 03:02:23     1
 3     1 1421470513 2015-01-17 05:55:13     2
 4     1 1421470513 2015-01-17 05:55:13     2
 5     1 1421471816 2015-01-17 06:16:56     2
 6     1 1421475400 2015-01-17 07:16:40     3
 7     1 1421806839 2015-01-21 03:20:39     4
 8     2 1424217068 2015-02-18 00:51:08     1
 9     2 1424217150 2015-02-18 00:52:30     1
10     2 1424218395 2015-02-18 01:13:15     1

The second one takes the first timestart per user and creates 1 hour slots. Each following timestamp is assigned to one of those timeslots and the group is created based on those timeslots.

df %>% 
  group_by(user) %>% 
  mutate(time = as.POSIXlt(timestart, origin = "1970-01-01"),
         helper = (timestart %% first(timestart)) %/% 3600,
         grp = cumsum(helper - lag(helper, default = 0) > 0)   1) %>% 
  ungroup() %>% 
  select(-helper)

This one returns

# A tibble: 10 x 4
    user  timestart time                  grp
   <int>      <int> <dttm>              <dbl>
 1     1 1421286975 2015-01-15 02:56:15     1
 2     1 1421287343 2015-01-15 03:02:23     1
 3     1 1421470513 2015-01-17 05:55:13     2
 4     1 1421470513 2015-01-17 05:55:13     2
 5     1 1421471816 2015-01-17 06:16:56     3
 6     1 1421475400 2015-01-17 07:16:40     4
 7     1 1421806839 2015-01-21 03:20:39     5
 8     2 1424217068 2015-02-18 00:51:08     1
 9     2 1424217150 2015-02-18 00:52:30     1
10     2 1424218395 2015-02-18 01:13:15     1

Data

I added one data point to get better example data

df <- structure(list(user = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L
), timestart = c(1421286975L, 1421287343L, 1421470513L, 1421470513L, 
1421471816L, 1421475400L, 1421806839L, 1424217068L, 1424217150L, 
1424218395L)), class = "data.frame", row.names = c(NA, -10L))

CodePudding user response：

Consider some helper columns with several calls to ave:

output <- within(
  df, {
    timedt <- as.POSIXct(timestart, origin="1970-01-01")
    first <- ave(timedt, user, FUN=min)
    hour_diff <- round(as.numeric(difftime(timedt, first, unit="hours")))

    interval_1h <- ave(
      ifelse(ave(hour_diff, user, hour_diff, FUN=seq_along) == 1, 1, 0),
      user,
      FUN=cumsum
    )
    rm(timedt, first, hour_diff)
  }
)

output
  user  timestart interval_1h
1    1 1421286975           1
2    1 1421287343           1
3    1 1421470513           2
4    1 1421470513           2
5    1 1421471816           2
6    1 1421806839           3
7    2 1424217068           1
8    2 1424217150           1
9    2 1424218395           1