Home > Mobile >  Why are my randomly picked datetimes not equally distributed?
Why are my randomly picked datetimes not equally distributed?

Time:10-08

I found a weird situation in my randomly generated timestamps. I have an application where I generate artificial log data and I would like to be able to define the time range. Therefore I wrote a function like this:

# imports
from datetime import datetime
import time
from random import choice

timestamps = []
timerange_in_days = 14 # how many days back from today should my timestamps cover?
entries = 10000 # how many timestamps?

for _ in range(entries):
    
    last_midnight = (int(time.time() // 86400)) * 86400  # find date border
    days = range(1, timerange_in_days   1)  # set the range
    timestamp = last_midnight - (choice(days) * choice(range(1, 25)) * 3600)  # create the timestamp
    timestamp = datetime.fromtimestamp(timestamp).isoformat(timespec='milliseconds')  # format it
    timestamps.append(timestamp)

I then wrote this to a file and plotted in R, as I couldn't get it quickly visualized in python. I plotted a histogram by day and by hour, the little bar for October 8 comes from the timezone not being adjusted, meaning it goes until 2 am of the next day.


with open(r'/path/to/file/dates.txt', 'w') as myfile:
    for item in timestamps:
        my

file.write("%s\n" % item)
# in R
path <- "path/to/file"
dates <- data.table::fread(file.path(path, "dates.txt"))  # recognizes as POSIXct automatically
hist(dates$V1, "days")

enter image description here

hist(dates$V1, "hours")

enter image description here

But my question is, why are the timestamps more frequent around "now"? I want them to be equally spread across the days

CodePudding user response:

Rethink your logic. choice(days) * choice(range(1, 25)) means randomly picking a day, but then multiplying it by a random amount of hours between 1-24. This means your "days" are instead multiplied by the average of ~12 hours, so most of them are much closer to last_midnight.

A much better approach is

timestamp = last_midnight - (random() * timerange_in_days * 24 *3600)  # create the timestamp

Since random() gives a float between 0 and 1, you get the entire range between the earliest and latest period.

Also, you don't need to calculate last_midnight inside the loop, just do it once before entering the loop.

  • Related