I want to create time interval period for each group and my time format is numerical. Lets say I want 1 hour interval from the first record and every record within on hour will be in interval 1
and anything after 1 hour and less than 2 hours from the first record will be interval 2
and so on (for each user
group).
technically I am looking for creating bins of one hour from the beginning.
df<-read.table(text="
user timestart
1 1421286975
1 1421287343
1 1421470513
1 1421470513
1 1421471816
1 1421806839
2 1424217068
2 1424217150
2 1424218395",header=T,stringsAsFactors = F)
# result: (might not 100% accurate but you get the point)
user timestart interval_1h
1 1421286975 1
1 1421287343 1
1 1421470513 2
1 1421470513 2
1 1421471816 2
1 1421806839 3
2 1424217068 1
2 1424217150 1
2 1424218395 1
CodePudding user response:
For me, there are two interpretations of this question. Here are solutions for both of them. We use dplyr
to get the desired output:
- The first interpretation creates an output which is similar to your shown output but contradicts your actual question:
df %>%
mutate(time = as.POSIXlt(timestart, origin = "1970-01-01")) %>%
group_by(user) %>%
mutate(grp = cumsum(coalesce(difftime(time, lag(time), units = "hours") >= 1, TRUE))) %>%
group_by(user, grp) %>%
mutate(grp2 = difftime(time, first(time), units = "hours") >= 1) %>%
group_by(user) %>%
mutate(grp = grp cumsum(grp2), .keep = "unused") %>%
ungroup()
This returns
# A tibble: 10 x 4
user timestart time grp
<int> <int> <dttm> <int>
1 1 1421286975 2015-01-15 02:56:15 1
2 1 1421287343 2015-01-15 03:02:23 1
3 1 1421470513 2015-01-17 05:55:13 2
4 1 1421470513 2015-01-17 05:55:13 2
5 1 1421471816 2015-01-17 06:16:56 2
6 1 1421475400 2015-01-17 07:16:40 3
7 1 1421806839 2015-01-21 03:20:39 4
8 2 1424217068 2015-02-18 00:51:08 1
9 2 1424217150 2015-02-18 00:52:30 1
10 2 1424218395 2015-02-18 01:13:15 1
- The second one takes the first
timestart
per user and creates 1 hour slots. Each following timestamp is assigned to one of those timeslots and the group is created based on those timeslots.
df %>%
group_by(user) %>%
mutate(time = as.POSIXlt(timestart, origin = "1970-01-01"),
helper = (timestart %% first(timestart)) %/% 3600,
grp = cumsum(helper - lag(helper, default = 0) > 0) 1) %>%
ungroup() %>%
select(-helper)
This one returns
# A tibble: 10 x 4
user timestart time grp
<int> <int> <dttm> <dbl>
1 1 1421286975 2015-01-15 02:56:15 1
2 1 1421287343 2015-01-15 03:02:23 1
3 1 1421470513 2015-01-17 05:55:13 2
4 1 1421470513 2015-01-17 05:55:13 2
5 1 1421471816 2015-01-17 06:16:56 3
6 1 1421475400 2015-01-17 07:16:40 4
7 1 1421806839 2015-01-21 03:20:39 5
8 2 1424217068 2015-02-18 00:51:08 1
9 2 1424217150 2015-02-18 00:52:30 1
10 2 1424218395 2015-02-18 01:13:15 1
Data
I added one data point to get better example data
df <- structure(list(user = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L
), timestart = c(1421286975L, 1421287343L, 1421470513L, 1421470513L,
1421471816L, 1421475400L, 1421806839L, 1424217068L, 1424217150L,
1424218395L)), class = "data.frame", row.names = c(NA, -10L))
CodePudding user response:
Consider some helper columns with several calls to ave
:
output <- within(
df, {
timedt <- as.POSIXct(timestart, origin="1970-01-01")
first <- ave(timedt, user, FUN=min)
hour_diff <- round(as.numeric(difftime(timedt, first, unit="hours")))
interval_1h <- ave(
ifelse(ave(hour_diff, user, hour_diff, FUN=seq_along) == 1, 1, 0),
user,
FUN=cumsum
)
rm(timedt, first, hour_diff)
}
)
output
user timestart interval_1h
1 1 1421286975 1
2 1 1421287343 1
3 1 1421470513 2
4 1 1421470513 2
5 1 1421471816 2
6 1 1421806839 3
7 2 1424217068 1
8 2 1424217150 1
9 2 1424218395 1