I am trying to generate a Markov simulation using a specific sequence as start, using the mchmm library coded with scipy and numpy. I am not sure if I am using it correctly, since the library also has Viterbi and Baum-Welch algorithms in the context of Markov, which I am not familiar with.
To illustrate, I will continue with an example.
data = 'AABCABCBAAAACBCBACBABCABCBACBACBABABCBACBBCBBCBCBCBACBABABCBCBAAACABABCBBCBCBCBCBCBAABCBBCBCBCCCBABCBCBBABCBABCABCCABABCBABC'
a = mc.MarkovChain().from_data(data)
I want a markov simulation based on a 3 states transition matrix, starting with the last 3 characters in the sequence above ("ABC")
start_sequence = data[-3:]
tfm3 = a.n_order_matrix(a.observed_p_matrix, order=3) #this is because I want an order 3 transition matrix
ids, states = a.simulate(n=10, tf=tfm3, start=start_sequence)
this returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_2552615/2308700615.py in <module>
----> 1 ids, states = a.simulate(n=10, tf=tfm3, start=start_sequence)
~/anaconda3/lib/python3.8/site-packages/mchmm/_mc.py in simulate(self, n, tf, states, start, ret, seed)
304 _start = np.random.randint(0, len(states))
305 elif isinstance(start, str):
--> 306 _start = np.argwhere(states == start).item()
307
308 # simulated sequence init
ValueError: can only convert an array of size 1 to a Python scalar
I was expecting to get a sequence of 10 characters, starting with the string 'ABC' (data[-3:]), since I want to constraint the Markov simulation to start with the probabilities implied by that specific sequence, and following a Markov of order 3.
Any feedback?
CodePudding user response:
The states in the MarkovChain
instance a
are 'A'
, 'B'
and 'C'
. When the simulate
method is given a string for state
, it expects it to be the name of one of the states, i.e. either 'A'
, 'B'
or 'C'
. You get that error because data[-3:]
is not one of the states.
For example, in the following I use start='A'
in the call of simulate()
, and it generates a sequence of 10 states, starting at 'A'
:
In [26]: data = 'AABCABCBAAAACBCBACBABCABCBACBACBABABCBACBBCBBCBCBCBACBABABCBCBA
...: AACABABCBBCBCBCBCBCBAABCBBCBCBCCCBABCBCBBABCBABCABCCABABCBABC'
In [27]: a = mc.MarkovChain().from_data(data)
In [28]: tfm3 = a.n_order_matrix(a.observed_p_matrix, order=3)
In [29]: ids, states = a.simulate(n=10, tf=tfm3, start='A')
In [30]: states
Out[30]: array(['A', 'C', 'A', 'C', 'C', 'C', 'A', 'A', 'C', 'B'], dtype='<U1')
If you are trying to create a Markov chain where the states are sequences of three symbols (to add "history" that includes the previous two states), you could create a new input to .from_data()
that consists of the length-3 overlapping subsequences of data
(also known as the 3-grams). For example,
In [65]: data3 = [data[k:k 3] for k in range(len(data)-2)]
In [66]: data3[:4]
Out[66]: ['AAB', 'ABC', 'BCA', 'CAB']
In [67]: data3[-8:]
Out[67]: ['CAB', 'ABA', 'BAB', 'ABC', 'BCB', 'CBA', 'BAB', 'ABC']
In [68]: a = mc.MarkovChain().from_data(data3)
Take a look at the states of this Markov chain:
In [69]: a.states
Out[69]:
array(['AAA', 'AAB', 'AAC', 'ABA', 'ABC', 'ACA', 'ACB', 'BAA', 'BAB',
'BAC', 'BBA', 'BBC', 'BCA', 'BCB', 'BCC', 'CAB', 'CBA', 'CBB',
'CBC', 'CCA', 'CCB', 'CCC'], dtype='<U3')
Simulate 10 transitions, starting with the last state in data3
:
In [70]: ids, states = a.simulate(n=10, start=data3[-1])
In [71]: states
Out[71]:
array(['ABC', 'BCA', 'CAB', 'ABC', 'BCB', 'CBC', 'BCB', 'CBA', 'BAA',
'AAB'], dtype='<U3')
Compress the states to only include the final single character, so it is back in the form of the original data
input:
In [72]: ''.join([state[-1] for state in states])
Out[72]: 'CABCBCBAAB'