Home > other >  Forward [] reinforcement learning introduction: based on the Q - learning algorithm of the days when
Forward [] reinforcement learning introduction: based on the Q - learning algorithm of the days when

Time:09-26

The data, this article use comes from JQData local quantitative financial database, below I will introduce a rough a reinforcement learning simple example application in the securities market,
Theory and the development history of Reinforcement Learning algorithm, we don't do too much explanation, we can easily found on the Internet of the Reinforcement Learning theory knowledge, although may be a few words, but for beginners basic also will do the trick, so far, there is no widely trained in praise of Chinese teaching materials, and more resources are in English, for example, Richard S.S utton and Andrew G.B arto's "Reinforcement Learning: An the Introduction", this is a good teaching material of Reinforcement Learning, want system, in-depth study of intensive study, the book is worth a look, although there are a lot of domestic academia on Reinforcement Learning, but they all look more professional, I don't recommend beginners a began to chew theory, is the best way to learn your first entry, what can make sense of Reinforcement Learning? Then apply some simple arithmetic to build one you are trying to solve the problem, and then continue to improve your algorithm, and in-depth study, in the process for this article, we assume that you already have some basic knowledge of the reinforcement learning, here only gives a very simple about the application of the quantitative analysis of the demo,
As a professional in the field of quantitative analysis, we may have to solve with reinforcement learning to play the game, to find the treasure is not interested in the Demo, we also hope to be able to have a simple reinforcement learning Demo: when the input K line data, you can tell me when to buy, when to sell, even buying and selling point is not accurate, but we can see the reinforcement learning model is how the buying and selling point, this article made a Demo, mainly want to introduce how to build the securities market to build a simple reinforcement learning model,
Reinforcement learning compared with a common machine learning algorithms such as neural network, reinforcement learning is more flexible, the depth of the neural network, the convolution algorithm neural network is difficult, but for the application, you only need to understand basic can input and output, but not of intensive study, must want to characteristics of the problem of abstract modeling, this is often the most difficult, how to abstract from a pile of stock data all kinds of state, as well as how these state transformation, how to define the action, returns, etc., these problems directly determines the quality of your model,
In this article, we need to solve the question is: how to use five minutes a day of 48 root of K line data exploration at the end of every 5 minutes, we are the buying (B), or to sell (S), or to wait (W), and all the 5 minutes data over a period of time training the model, see which time is best for buying, what time is best for sell, we take the time as a state id, state (S) transfer are better defined, 935 (9 am 35, this point has produced the first K line) - & gt; 940 - & gt; 945 - & gt; ... -> 1455 - & gt; 1500, the state s - & gt; S 'can take action (A) containing B, s, W, we use Q - learning algorithm to solve the problem, therefore, Q meter should be like this:
With regard to Reward, we is defined like this: the yield of the future for a period of time, such as the future of 3 K price, with these, we can set out to write programs, basic
First create a environment categories:

Times=[935, 940, 945, 950, 955, 1000, 1005, 1010, 1015,
1020, 1025, 1030, 1035, 1040, 1045, 1050, 1055,
1100, 1105, 1110, 1115, 1120, 1125, 1130, 1305,
1310, 1315, 1320, 1325, 1330, 1335, 1340, 1345,
1350, 1355, 1400, 1405, 1410, 1415, 1420, 1425,
1430, 1435, 1440, 1445, 1450, 1455, 1500]


The class Market:
Def __init__ (self, data) :
Self. Action_space=[' B ', 'S', 'W'] # buying, selling, watching
Self. N_actions=len (self. Action_space)
The self. The data=https://bbs.csdn.net/topics/data # 935 940... Data from 1500 48 K thread
Self. Time=935
Pass

Def step (self, action) :
# to know the current in the state that point in time, use the time point of R (earnings) as a
# current take action reward
Tix=times. The index (self. Time)
Nix=tix + 1
If self. Time==1500:
Reward=0
The done=True
S_='terminal'
# print (' time is over. ')
The else:
Reward=self. Data. R.i loc [nix]
The done=False
S_=times [nix]
If the action=='B' :
Pass
Elif action=='S' :
# when R is - S choice, should be a positive reward
Reward=reward * - 1
The else:
# option on the sidelines, neither a loss nor a profit, but to lose opportunity cost
# our current decisions on watching with objective attitude, reward=0, this
# may need to adjust under different market grail
Reward=0
Pass
Self. Time=s_
Return the s_, reward, done
Pass

Def reset (self) :
Self. Time=935
Return the self. The time
Pass
And then create the Q - learning algorithm (or call the class a Agent) :

The class QLearning:

# Agent


Def __init__ (self, the actions, q_table=None, learning_rate=0.01,
Discount_factor=0.9, e_greedy=0.1) :
The self. The actions=# actions action list
Self. Lr=learning_rate # learning rate
The self. The gamma=# discount_factor discount factor
The self. The epsilon=# e_greedy degree of greed
# column is the action,
If q_table is None:
Self. Q_table=pd. DataFrame (columns=self. The actions, dtype=np. Float32) # Q meter
The else:
Self. Q_table=q_table

# test q_table did this state
# if you don't have the current state, then we will insert a set of 0 data completely, as all the action of the state of the initial value
Def check_state_exist (self, state) :
# state corresponding to each line, if not in the Q meter,
If the state not in self. Q_table. Index:
# to insert a set of 0 data completely, to every action assignment of 0
Self. Q_table=self. Q_table. Append (
Pd. Series (
[0] * len (self. Actions),
The index=self. Q_table. The columns,
Name=state,
)
)

# according to the state to choose the action
Def choose_action (self, state) :
Self. Check_state_exist (state) # test whether this state exists in the q_table
# choice behavior, using Epsilon Greedy Greedy method
If np. Random. Uniform () & lt; Self. Epsilon:
# random action
Action=np, the random choice (self. Actions)
The else: # choose Q value of the highest action
State_action=self. Q_table. Loc [state:]
# the same state, there may be multiple action the same Q value, so we order
State_action=state_action. Reindex (np, the random permutation (state_action. Index))
# in each row to Q value of the largest
Action=state_action. Idxmax ()
Return the action

# learning, update the Q values in table
Def learn (self, s, a, r, s_) :
# s_ is under a state
Self. Check_state_exist (s_) whether exist in # test q_table s_

# Q (S, A) & lt; - Q (S, A) [R + v + A * * Max (Q (S ', A)) - Q (S, A)]

Q_predict=self. Q_table. Loc [a] s, # according to Q meter get estimated values (predict)

# q_target is real value
If the s_! :='terminal' # next state not terminators
Q_target=r + self. Gamma * self. Q_table. Loc [s_ :]. Max ()
The else:
Q_target=r # next state is terminators

# update Q state in the table - the value of action
Self. Q_table. Loc [a] s, +=self. The lr * (q_target - q_predict)
The last is to create a file to coordinate the above two classes start work:

Def update (data, q_table=None) :
Env=Market (data)
RL=QLearning (actions=env. Action_space q_table=q_table)

For episode came in range (100) :
# initialization state (state)
State=env. Reset ()

Step_count=0 # record through the steps of

While True:
# update visualization environment
# env. Render ()
# RL brain according to the state selected action
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related