Q-table representation for nested lists as states and tuples as actions-CodePudding

How can I create a Q-table, when my states are lists and actions are tuples?

Example of states for N = 3

[[1], [2], [3]]
[[1], [2, 3]]
[[1], [3, 2]]
[[2], [3, 1]]
[[1, 2, 3]]

Example of actions for those states

[[1], [2], [3]] -> (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)
[[1], [2, 3]] -> (1, 2), (2, 0), (2, 1)
[[1], [3, 2]] -> (1, 3), (3, 0), (3, 1)
[[2], [3, 1]] -> (2, 3), (3, 0), (3, 2)
[[1, 2, 3]] -> (1, 0)

I was wondering about

# q_table = {state: {action: q_value}}

But I don't think, thats a good design.

CodePudding user response：

1. Should your states really be of type list?

list is a mutable type. tuple is the equivalent immutable type. Do you mutate your states during learning? I doubt it.

In any case if you use list, you cannot use it as a dictionary key (because it is mutable)

2. Otherwise this is a pretty good representation

In a reinforcement learning context, you’ll want to

get a specific value for Q
Look at the Q values for all possible actions in a specific state (to find the maximal Q)

Your representation allows you to do both of these with minimal complexity, and is pretty clear. So it is a good representation.

CodePudding user response：

Using a nested dictionary is actually a reasonable design choice for custom tabular reinforcement learning---it's called tabular for a reason :)

You could use defaultdict to initialize the q-table to a certain value, e.g., 0.

from collections import defaultdict

q = defaultdict(lambda: defaultdict(lambda: default_q_value))

or without defaultdict:

q = {s: {a: default_q_value for a in actions} for s in states}

It is then convenient to perform updates by getting the max by something like so

best_next_state_val = max(q[s].values())
q[state][action]  = alpha * (reward   gamma * best_next_state_val)

One thing I'd just watch out for is that if you train an agent using a q-table like this, it will pick the same action each time if all the values for the actions are equal (such as when the qf is initialized).

Finally, if you don't want to use dictionaries, you can just map state and action tuples to indices, store the mapping in a dictionary, and use a lookup when you pass the state/action to your environment implementation. You can then just use them as indices of a 2d numpy array.