generating a Markov chain simulation using a transition matrix of specific size and with a given see-CodePudding

I am trying to generate a Markov simulation using a specific sequence as start, using the mchmm library coded with scipy and numpy. I am not sure if I am using it correctly, since the library also has Viterbi and Baum-Welch algorithms in the context of Markov, which I am not familiar with.

To illustrate, I will continue with an example.

data = 'AABCABCBAAAACBCBACBABCABCBACBACBABABCBACBBCBBCBCBCBACBABABCBCBAAACABABCBBCBCBCBCBCBAABCBBCBCBCCCBABCBCBBABCBABCABCCABABCBABC'
a = mc.MarkovChain().from_data(data)

I want a markov simulation based on a 3 states transition matrix, starting with the last 3 characters in the sequence above ("ABC")

start_sequence = data[-3:]
tfm3 = a.n_order_matrix(a.observed_p_matrix, order=3) #this is because  I want an order 3 transition matrix
ids, states = a.simulate(n=10, tf=tfm3, start=start_sequence)

this returns:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2552615/2308700615.py in <module>
----> 1 ids, states = a.simulate(n=10, tf=tfm3, start=start_sequence)

~/anaconda3/lib/python3.8/site-packages/mchmm/_mc.py in simulate(self, n, tf, states, start, ret, seed)
    304             _start = np.random.randint(0, len(states))
    305         elif isinstance(start, str):
--> 306             _start = np.argwhere(states == start).item()
    307 
    308         # simulated sequence init

ValueError: can only convert an array of size 1 to a Python scalar

I was expecting to get a sequence of 10 characters, starting with the string 'ABC' (data[-3:]), since I want to constraint the Markov simulation to start with the probabilities implied by that specific sequence, and following a Markov of order 3.

Any feedback?

CodePudding user response：

The states in the MarkovChain instance a are 'A', 'B' and 'C'. When the simulate method is given a string for state, it expects it to be the name of one of the states, i.e. either 'A', 'B' or 'C'. You get that error because data[-3:] is not one of the states.

For example, in the following I use start='A' in the call of simulate(), and it generates a sequence of 10 states, starting at 'A':

In [26]: data = 'AABCABCBAAAACBCBACBABCABCBACBACBABABCBACBBCBBCBCBCBACBABABCBCBA
    ...: AACABABCBBCBCBCBCBCBAABCBBCBCBCCCBABCBCBBABCBABCABCCABABCBABC'

In [27]: a = mc.MarkovChain().from_data(data)

In [28]: tfm3 = a.n_order_matrix(a.observed_p_matrix, order=3)

In [29]: ids, states = a.simulate(n=10, tf=tfm3, start='A')

In [30]: states
Out[30]: array(['A', 'C', 'A', 'C', 'C', 'C', 'A', 'A', 'C', 'B'], dtype='<U1')

If you are trying to create a Markov chain where the states are sequences of three symbols (to add "history" that includes the previous two states), you could create a new input to .from_data() that consists of the length-3 overlapping subsequences of data (also known as the 3-grams). For example,

In [65]: data3 = [data[k:k 3] for k in range(len(data)-2)]

In [66]: data3[:4]
Out[66]: ['AAB', 'ABC', 'BCA', 'CAB']

In [67]: data3[-8:]
Out[67]: ['CAB', 'ABA', 'BAB', 'ABC', 'BCB', 'CBA', 'BAB', 'ABC']

In [68]: a = mc.MarkovChain().from_data(data3)

Take a look at the states of this Markov chain:

In [69]: a.states
Out[69]: 
array(['AAA', 'AAB', 'AAC', 'ABA', 'ABC', 'ACA', 'ACB', 'BAA', 'BAB',
       'BAC', 'BBA', 'BBC', 'BCA', 'BCB', 'BCC', 'CAB', 'CBA', 'CBB',
       'CBC', 'CCA', 'CCB', 'CCC'], dtype='<U3')

Simulate 10 transitions, starting with the last state in data3:

In [70]: ids, states = a.simulate(n=10, start=data3[-1])

In [71]: states
Out[71]: 
array(['ABC', 'BCA', 'CAB', 'ABC', 'BCB', 'CBC', 'BCB', 'CBA', 'BAA',
       'AAB'], dtype='<U3')

Compress the states to only include the final single character, so it is back in the form of the original data input:

In [72]: ''.join([state[-1] for state in states])
Out[72]: 'CABCBCBAAB'