loving you as a whole

when you love someone as a whole,. “loving you as a whole” is published by a n o n i m..

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Value Iteration

RL problems modeled as MDP.

Several examples of deep RL success stories include Atari game 2013, Go player 2016.

Some of deep RL success stories

There are multiple ways of solving RL problems. Here we first look at policy iteration and value iteration, which are based on dynamic programming.

To formulate the problem, an MDP is defined by

As an example, consider a Gridworld as shown below. The agent can take actions of moving to north, east, west, and south. If the agent reaches the blue diamond, it will receive a reward +1. If it falls into the orange square, it will receive a reward -1. Reaching anywhere else in the maze has zero rewards.

Gridworld example

The goal is to find the optimal policy to maximize the expected sum of the rewards under that policy.

A policy π determines what action to take for a given state. It could be a distribution over actions or a deterministic function. As an example, a deterministic policy π for the Gridworld is shown below.

An example of policy π to take actions on the Gridworld.

The problem of optimal control or planning is to given an MDP (S, A, P, R, γ, H), find the optimal policy π*. Two exact methods to solve this problem are value iteration and policy iteration.

In value iteration, a concept called optimal value function V* is defined as

which is the sum of discounted rewards when starting at state s and acting optimally.

For example, the optimal value function of Gridworld with deterministic transition function, that is actions are always successful, gamma=1 and H=100 are calculates as below:

V*(4,3) = 1

V*(3,3) = 1

V*(2,3) = 1

V*(1,1) = 1

V*(4,2) = -1

Another example, the optimal value function of Gridworld when actions are always successful, gamma=0.9 and H=100 are calculated as:

V*(4,3) = 1

V*(3,3) = 0.9 # because of discount factor gamma=0.9

V*(2,3) = 0.9*0.9 = 0.81

V*(1,1) = 0.9*0.9*0.9*0.9*0.9 = 0.59

V*(4,2) = -1

Another example, the optimal value function if actions are successful with probability 0.8, with probability 0.1 it may stay in the same place and probability of 0.1 it may go to neighbor state, gamma = 0.9, H = 100, are calculated as:

V*(4,3) = 1

V*(3,3) = 0.8 * 0.9 + 0.1 * 0.9 * V*(3,3) + 0.1 * 0.9 * V*(3,2)

V*(2,3) =

V*(1,1) =

V*(4,2)

As you can see, in case of stochastic transition function, the optimal value function for a state depends on the value function of other states. Another word, it requires a recursive/iterative calculation. That is where value iteration can play a role!

The value iteration algorithm is shown below:

Here:

The optimal values for Gridworld with H=100, discount=0.9, and noise=0.2 is calculated and shown below. It is noted that after a certain number of iterations, the value function stops changing significantly.

Value iteration is guaranteed to converge. At convergence, the optimal value function is found and as a result, the optimal policy is found.

We consider another method called Q-learning to solve RL problems. To this end, optimal Q-value is defined as:

Optimal Q-value function at the state s taking action a.

Q-values are similar to V-values except that in addition to state s, action a is also given to the function. Similarly, there is a Bellman equation for Q-values

Bellman Equation for optimal Q-value function

To solve Q*, Q-value iteration is defined as

There are multiple benefits of using Q-learning vs value iteration that will be discussed later. For now, it is worth noting that in Q-learning, we only compute Q-values and it implicitly encodes the optimal policy as opposed to value iteration that we need to keep track of both policy and value function.

As an example, Q-values for the Gridworld with gamma=0.9 and noise=0.2 after 100 iterations would look like below. There are four Q-values per state since there are four actions to take.

Finally, we look at policy evaluation/policy iteration. In policy evaluation, we fix the policy and compute the value iteration for a given policy as

As seen in the above equation, the max operation is removed since the policy is fixed now and as a result, there is only one action to take, which is π(s).

And thus, policy iteration is given as below. We repeat until policy converges. It converges faster than value iteration under some conditions.

Policy iteration

Note: Lab 1 includes examples for value iteration and policy iteration.

In the first lecture, the basics of RL and MDP were introduced. The exact methods to solve small MDP problems were described. These methods are: value iteration, Q-learning and policy iteration. The limitations of these methods include: they require to iterate over and have storage for all states and actions. So they are suitable for small, discrete state-action space. Also, to update equations they require access to the dynamics of the environment or the transition function P(s’|s,a).

Add a comment

Related posts:

Developing a Smart Contract on Elrond Blockchain Network

The command group contract is for building, deploying, upgrading, and interacting with Smart Contracts. The last is an optional argument template with which we specify the template to use. We use…

What exactly does it mean to be a business consultant?

Business consultant. It’s one of those nebulous titles almost as enigmatic as one who goes to work each day to perform his job as internal optimization coordinator, product integration analyst, or…

A Deep Dive into African Data Science on Kaggle

From a random DM on LinkedIn to Mabu for data science advice; into a long advice call and a resultant surprise offer of mentorship; followed by periodic calls about progress (and frustrations) on…