How would I generate a random data series in Python with events?-CodePudding

I'm trying to generate a random data series (or a time series) for anomaly detection, with events spanning a few consecutive data points. They could be values above/below a certain threshold, or anomaly types with different known probabilities.

e.g. in a case where 1 is normal and event types are within [2, 3, 4]: 11112221113333111111112211111

I looked through the np.random and random methods, but couldn't find any that generate these events. My current solution is picking random points, adding random durations to them to generate event start and end positions, labeling each event with a random event type, and joining back to the dataset, something like:

import numpy as np
num_events = np.random.randint(1, 10)
number_series = [1]*60
first_pos = 0 
event_starts = sorted([first_pos   i for i in np.random.randint(50, size = num_events)])
event_ends = [sum(i) for i in list(zip(event_starts, np.random.randint(8, size = num_events)))]
for c in list(zip(event_starts, event_ends)):
    rand_event_type  = np.random.choice(a = [2, 3, 4], p = [0.5, 0.3, 0.2])
    number_series[c[0]:c[1]] = [rand_event_type]*len(number_series[c[0]:c[1]])
print(number_series)

[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 3, 3, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

But I'm wondering if there is a simpler way to just generate a series of numbers with events, based on a set of probabilities.

CodePudding user response：

It all depends on how you model your process (the underlying process you want to simulate). You can read more about some of the usual models on Wikipedia.

Simplest

In the following, we use a very simple model (slightly different than yours): events each have a probability (like in your question) and a random duration that is independent of the event itself. 1 ("normal") is an event like any others (unlike your sample code). We could change that, but right now this is one of the simplest models you can think of.

def gen_events(n):
    events = np.random.choice(a=[1, 2, 3, 4], p=[0.6, 0.2, 0.12, 0.08], size=n)
    durations = np.random.randint(1, 8, size=n)
    return np.repeat(events, durations)

np.random.seed(0)  # repeatable example
number_series = gen_events(10)  # for example

>>> number_series
array([1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 1, 1, 1, 3, 4, 4, 1, 1, 1, 1, 1])

Note, this is very fast:

%timeit gen_events(1_000_000)
# 44.9 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Markov chain

Another model (easier to parameterize, a bit more complex to implement) would be a Markov model. The simplest of them would be a Markov chain. Here is a super simple (but not very efficient) version:

def markov_chain(P, n, initial_state=0):
    m = P.shape[0]
    ix = np.arange(m)
    s = np.empty(n, dtype=int)
    s[0] = initial_state
    for i in range(1, n):
        s[i] = np.random.choice(ix, p=P[s[i-1]])
    return s

Above, P is a transition matrix, where each cell P[i,j] is the probability to transition from state i to state j. Here is an example application:

P = np.array([
    [.7, .1, .12, .08],  # from 0 to others
    [.3, .6, .05, .05],
    [.3, 0, .65, .05],
    [.4, 0, .05, .55],
])

np.random.seed(0)
n = 100
s = markov_chain(P, n)   1
>>> s
array([1, 1, 2, 2, 2, 2, 2, 2, 2, 4, 1, 2, 2, 2, 3, 1, 1, 1, 3, 3, 3, 4,
       4, 4, 4, 1, 1, 1, 4, 4, 3, 1, 2, 2, 2, 1, 1, 1, 1, 4, 4, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 3, 1, 3, 1, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 4, 1, 1, 1, 2, 1, 1, 1, 1, 3])

Note that the unigram probability of each event is called pi and corresponds to any of the rows of lim_{k -> \infty} P**k:

>>> pd.Series(markov_chain(P, 1000, 0)).value_counts(normalize=True).sort_index()
0    0.530
1    0.135
2    0.209
3    0.126

>>> np.linalg.matrix_power(P, 40)[0]
array([0.52188552, 0.13047138, 0.21632997, 0.13131313])

CodePudding user response：

A less verbose way would be to generate your list of events on the go.

Set, for example, a probability for an occurrence of an anomaly (say, 5%). Then,

events = []
for i in range(60):
  if random() <= 0.95:
    events.append(1)
  else:
    events.extend([choice(a = [2, 3, 4], p = [0.5, 0.3, 0.2])] * randint(8))

CodePudding user response：

You can generate random numbers from a uniform distribution over [0, 1) and use numpy.select and select which number will be 1, 2, 3, 4 like below:

import numpy as np
def generate_random_data_series(num, prob=[0.6,0.2,0.05,0.15]):
    x = np.random.rand(num)
    prob = np.cumsum(np.asarray(prob))
    condlist = [
        x < prob[0], 
        x < prob[1], 
        x < prob[2], 
        x < prob[3]
    ]
    choicelist = [1,2,3,4]
    return np.select(condlist, choicelist, default=1)

Benchmark on colab:

%timeit generate_random_data_series(1_000_000)
# 25.1 ms per loop (10 loops, best of 5)

Test function:

>>> from collections import Counter
>>> res = generate_random_data_series(100)
>>> res
array([1, 1, 4, 1, 4, 1, 1, 1, 4, 1, 3, 4, 4, 1, 1, 1, 1, 4, 1, 1, 2, 1,
       4, 1, 1, 1, 1, 1, 2, 1, 1, 4, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2,
       1, 2, 2, 1, 1, 4, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 1, 2, 1, 4, 1,
       1, 1, 2, 1, 1, 1, 1, 4, 1, 4, 2, 4, 4, 4, 2, 3, 2, 2, 2, 2, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1])
>>> Counter(res)
Counter({1: 61, 4: 15, 3: 3, 2: 21})
# prob  1 : 60%
# count 1 : 61 in 100 random number
# prob  2 : 20%
# count 2 : 21 in 100 random number
# prob  3 : 5%
# count 3 : 3 in 100 random number
# prob  4 : 15%
# count 4 : 15 in 100 random number