I have data with information about calls (about 3 million rows).
caller
user_1
user_2
user_3
user_N
I need to create one more column with a random timestamp for each user
call, i.e. I want to get something like this:
caller | timestamp |
---|---|
user_1 | 2019-12-24 21:00:07 |
user_2 | 2019-12-27 20:03:19 |
user_3 | 2020-01-11 19:30:54 |
user_N | 2020-02-15 22:37:12 |
Due to restrictions, the time can only be between 18:00:00
and 23:59:59
and dates must be in the range from Jan 1, 2019
to Jan 1, 2021
.
Is it possible to implement this in R
? Perhaps there are some functions that can be useful here?
I would be grateful for any help!
CodePudding user response:
Given data frame with id's:
df <- data.frame(caller = 1:3E6)
You could run
df$timestamp = as.POSIXct("2019-01-01 00:00", tz = "GMT")
floor(runif(nrow(df), max = 365))*24*60*60
runif(nrow(df), min = 18*60*60, max = 24*60*60)
which would add a uniform random number of days, and a random number of seconds between 18 and 24 hours' worth.
We can verify that the timestamps occur in the desired range:
range(df$timestamp)
range(lubridate::hour(df$timestamp) lubridate::minute(df$timestamp)/60)
CodePudding user response:
One approach of generating random timestamps in a range is by generating a sequence of all possible timestamp in the range by using seq
function, and then randomly select n timestamps from them by using sample
function. For example if you want to generate 3 random timestamps between Jan 1, 2021
and Jan 3, 2021
, in the unit of second
, you can do:
set.seed(1)
seq(as.POSIXct("2021-01-01 00:00:00") ,as.POSIXct("2021-01-03 23:59:59"), by = "s") |>
sample(3)
#[1] "2021-01-01 06:46:27 07" "2021-01-03 04:56:32 07"
#[3] "2021-01-02 10:33:32 07"
Note: You can specify your own time zone by using tz
in as.POSIXct
function.
By this approach, you can get 3 million random timestamps by the following steps:
- Set the start and the end of the daily range to
18:00:00
and23:59:59
, respectively.
starts <- seq(as.POSIXct("2019-01-01 18:00:00"), as.POSIXct("2021-01-01 18:00:00"),
by = "days")
ends <- seq(as.POSIXct("2019-01-01 23:59:59"), as.POSIXct("2021-01-01 23:59:59"),
by = "days")
- Calculate the number of samples for each day
ndays = length(starts)
n = 3e6/ndays
- Randomly select n samples from all possible timestamps on each day, and the store the samples in a list.
sampled_timestamps <- vector("list", ndays)
for (k in 1:ndays) {
sampled_timestamps[[k]] <- seq(starts[k], ends[k], by = "hours") |>
sample(n)
}
- Convert the
sampled_timestamps
to a vector to be able to use it as a column in a data frame.
v_sampled_timestamps <- do.call("c", sampled_timestamps)
Now you can use v_sampled_timestamps
to fill in the values of the timestamps
column in your data frame.
your_df$timestamps <- v_sampled_timestamps