Ariel University
Course number: 7061510-1
Lecturer: Dr. Amos Azaria
Edited by: Moshe Hanukoglu
Date: First Semester 2018-2019
Until now, we have dealt with the structure of the network that studies but has not dealt with the learning process itself. When a person wants to teach his dog to take action then he gives the dog positive feedback when he does the right action at the right time otherwise he gives the dog negative feedback.
This method is called the carrot and stick method.
While learning the agents, we will do the same thing, meaning that when the agent performs a good action, he will be given a positive feedback otherwise given negative feedback.
Important thing to emphasize: the reward depends not only on the action, but also on the state of the environment!
One of the cases that coaches agents by feedback is in games where the agent plays the game alone.
We'll show you an environment that allows you to connect very simple games to the Python code. In this environment you can build an agent and train him on the game.
First, you need to install the package by typing the following command in bash
pip install gym
We'll see an example of a very simple game called FrozenLake-v0
The rules of the game are:
Start with the letter "S"
You have to reach the letter G
At any stage you can move within the field to the right, left, up and down.
When you walk on the letter F move randomly (may be different from the direction you wanted)
When you reach the letter H, you are disqualified and you can not move anymore.
The steps are:
import gym
import sysenv = gym.make('FrozenLake-v0')
env.reset()
env.render()
move = input()
while move.isdigit():
dig_move = int(move)
if (dig_move < env.action_space.n):
env.step(dig_move)
env.render()
move = input()
In order for us to analyze the game and we can teach the agent to play the game, a mathematical model can be used to simulate the game.
The MDP model is a randomly controlled time-controlled process or part of the process is random and user-controllable.
The model contains four parts (S, A, R, T):
To expand the subject link
We will focus on Model-Free MDP ie, we do not know anything about Transition function, or the reward function. But we know all the states and possible actions.
We will define some concepts for understanding the sequel:
There are two approaches to selecting the action in each state and learning Qvalue, Exploration and Exploitation values
We will combine the methods, that is, we will perform a few random actions and a lot of actions based on previous things, in this way also learn new situations and learn to work correctly and progress in the real direction. And the approach that will use only one of the methods is incorrect because learning only randomly will not reach other world situations, but only rarely (it is very difficult to finish a game in a random way). And learning only on the basis of previous information will not introduce new moves. Therefore, a small value ($\epsilon$) is set so that we will perform random action only if a value is smaller than $\epsilon$ otherwise we will perform the intelligent operation. In the probability $\epsilon$ we will execute random action and probability 1 - $\epsilon$ will perform an action based on previous information.
import gym
import sys
import random
import numpy as npepsilon = 0.1
gamma = 0.95
alpha = 0.1env = gym.make('FrozenLake-v0')
env.reset()
env.render()# Q is a table of states and all actions per state
Q = np.zeros ([env.observation_space.n,env.action_space.n])
for i in range(10000): #number of games
s = env.reset()
done = False
while not done: #until game over, or max number of steps
if random.random() < epsilon:
a = env.action_space.sample()
else:
a = np.argmax(Q[s,:])
# Values are returned: the next state, reward and whether the game is over
s_n, r, done, _ = env.step(a)
Q[s,a] = Q[s,a] + alpha*(r + gamma * np.max(Q[s_n,:]) - Q[s,a])
s = s_n
The rows with the weights 0 represent the states where the player is in H or G that can not continue playing.
Q
The method presented above, which produces a table representing all states and all actions for each state, is good when there is a small number of states, but what happens if there are lots of states?
The answer is that we will try to predict the Q-values by a network that will receive state and action and return the Q-value, and we will select the action that returns the maximum Q-value. In this way we will not have to remember all the values but predict them when we need them.
But where is data for this network train?
We perform an action (that maximizes the Q-value) and get the reward (r), the new state ($s_n$), after this we calculate the Q-values of the new state (using the same network), for each of the actions.
Now the "real" Q-value, for the previous state-action, is: $r + γ * max_{a'}Q(s_n,\ a')$$.
import gym
import time
import random
import numpy as np
import tensorflow as tfepsilon = 0.1
gamma = 0.999
num_of_games = 10env = gym.make('Pong-v0')# height:210, width: 160, Since the image is RGB then we get another 3 dimensions.
state = tf.placeholder(tf.float32, shape=[1, 210, 160, 3])
state_vec = tf.reshape(state, [1, 210*160*3])# env.action_space.n returns the number of action per state
W1 = tf.Variable(tf.truncated_normal([210*160*3,env.action_space.n], stddev=1e-5))
b1 = tf.Variable(tf.constant(0.0, shape=[env.action_space.n]))Q4actions = tf.matmul(state_vec,W1) + b1# Q_n, will denote the maximum Q (among all actions).
Q_n = tf.placeholder(tf.float32, shape=[1, env.action_space.n])loss = tf.pow(Q4actions - Q_n, 2)
update = tf.train.GradientDescentOptimizer(1e-10).minimize(loss)sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(num_of_games):
env.reset()
s, _, done, _ = env.step(0) #first move doesn't matter anyway, just get the state
while not done:
all_Qs = sess.run(Q4actions, feed_dict={state:[s]})
if random.random() < epsilon:
next_action = env.action_space.sample()
else:
next_action = np.argmax(all_Qs)
s_n, r, done, _ = env.step(next_action)
Q_corrected = np.copy(all_Qs)
next_Q = sess.run(Q4actions, feed_dict={state:[s_n]})
Q_corrected[0][next_action] = r + gamma * np.max(next_Q)
sess.run(update, feed_dict={state:[s], Q_n: Q_corrected})
s = s_n #move to the next state
When we use linear regression we don't get good results, therefore, we will use fully connected layers and CNN.
We can improve the results by using:
Until now we use the algorithm that computes the Q value base on a given game policy that is we estimate the value of state s with each action (base on our game policy) and choose the action that returns the maximum of Q value. In this method, the algorithm tries to found the most suitable Q value of each state.
The policy-based algorithms learning the policy of the game $\pi_\theta$ in each iteration that is, the algorithm re-define the policy at each step and computes the value according to this new policy until the policy converges.
In the policy-based algorithm wants to know if the policy is better than other policy, there for it compares the expected value of this policy with the previous policies, but in order to calculate the expected value the algorithm has to know the value of each action that it takes.
There are two solutions to this problem: