I have a pandas function that takes a day
column and assigns/concatenates a random time (hour:minute:seconds) to each day.
pd.to_datetime(d['day']) pd.to_timedelta(np.random.randint(0,24*3600, size=len(d)), unit='s'))
Example output: 1/1/2021 19:00:22, 1/1/2021 3:21:34
This works well and generates random datetimes on a given day. What I want, however, is to have more random timestamps between two times; in my case between 9:00AM and 7:00PM. So anything outside that time range will ultimately have fewer randomized values.
CodePudding user response:
Use np.random.choice
by creating individual probability for each second in a day.
# individual probability inside and outside the range 7-19
p_in = 0.8 / ((19-7)*3600)
p_out = 0.2 / (24*3600 - (19-7)*3600)
# array of probabilities
p = np.full(24*3600, p_out)
p[7*3600:19*3600] = p_in
# seconds in a day
t = np.arange(0, 24*3600)
>>> df['day'] pd.to_timedelta(np.random.choice(t, len(df), p=p), unit='s')
0 2021-11-03 18:18:30
1 2021-11-03 22:25:47
2 2021-11-03 15:04:09
3 2021-11-04 01:08:31
4 2021-11-03 17:51:53
...
117 2021-11-04 15:05:33
118 2021-11-04 07:12:58
119 2021-11-04 09:09:38
120 2021-11-05 00:17:58
121 2021-11-04 23:53:20
Length: 122, dtype: datetime64[ns]
You can adjust the probability (0.8 / 0.2) according to your needs.
>>> np.sum(p)
0.9999999999999999
>>> np.isclose(np.sum(p), 1)
True
Demo
df = pd.DataFrame({'day': pd.date_range("2021-01-01", "2021-01-31", freq='D')})
df['day'] pd.to_timedelta(np.random.choice(t, len(df), p=p), unit='s')
# Output:
0 2021-01-01 08:03:53
1 2021-01-02 02:48:28 # outside
2 2021-01-03 06:37:24
3 2021-01-04 18:15:01
4 2021-01-05 10:36:53
5 2021-01-06 06:41:23 # outside
6 2021-01-07 10:33:09
7 2021-01-08 13:23:46
8 2021-01-09 08:47:57
9 2021-01-10 07:37:35
10 2021-01-11 04:57:13 # outside
11 2021-01-12 17:01:39
12 2021-01-13 13:58:16
13 2021-01-14 08:57:05
14 2021-01-15 08:04:10
15 2021-01-16 20:07:45 # outside
16 2021-01-17 02:42:26
17 2021-01-18 17:10:00
18 2021-01-19 08:22:52
19 2021-01-20 18:07:02
20 2021-01-21 14:40:18
21 2021-01-22 08:39:55
22 2021-01-23 18:54:33
23 2021-01-24 06:39:38 # outside
24 2021-01-25 14:41:48
25 2021-01-26 07:54:33
26 2021-01-27 05:34:36 # outside
27 2021-01-28 18:55:51
28 2021-01-29 09:37:26
29 2021-01-30 22:07:28 # outside
30 2021-01-31 10:39:51
dtype: datetime64[ns]
Here, 7 values are outside the range and 24 inside, so the distribution is 0.226 and 0.774 (= 1.0). It's almost equal to the initial probability of 0.2 / 0.8.