Post on 13-Mar-2021
ReinforcementLearning
Model-Based ReinforcementLearning
Model-based, PAC-MDP, sample complexity,exploration/exploitation, RMAX, E3,
Bayes-optimal, Bayesian RL, model learning
Vien NgoMarc Toussaint
University of Stuttgart
Outline
• Model-Based RL
• Exploration/Exploitation
• PAC-MDP
• E3, RMAX
• Bayes optimal: Bayesian model-based RL
2/??
RL Approaches
3/??
Limitations of the model-free view
• given learnt values, behavior is a fixed SR (or state-action) mapping
• if the “goal” changes: need to re-learn values/policy for every state inthe world! all previous values are obsolete
• no general “knowledge”, only values
• no anticipation of outcomes (s′), only of value
• no “planning”
4/??
Wolfgang Kohler (1917)Intelligenzprufungen amMenschenaffenThe Mentality of Apes
[movie]
Goal-directed Decision Making
5/??
Wolfgang Kohler (1917)Intelligenzprufungen amMenschenaffenThe Mentality of Apes
[movie]
Goal-directed Decision Making
5/??
Goal-directed vs. habitual: Devaluation
Niv, Joel & Dayan: A normative perspective on motivation. TICS,10:375-381, 2006.
6/??
Goal-directed vs. habitual: Devaluation
Niv, Joel & Dayan: A normative perspective on motivation. TICS,10:375-381, 2006.
7/??
By definition, goal-directed behavior is performed to obtain a desired goal.Although all instrumental behavior is instrumental in achieving its contingentgoals, it is not necessarily purposively goal-directed. Dickinson and Balleine[1,11] proposed that behavior is goal-directed if: (i) it is sensitive to thecontingency between action and outcome, and (ii) the outcome is desired.Based on the second condition, motivational manipulations have been used todistinguish between two systems of action control: if an instrumental outcomeis no longer a valued goal (for instance, food for a sated animal) and thebehavior persists, it must not be goaldirected. Indeed, after moderate amountsof training, outcome revaluation brings about an appropriate change ininstrumental actions (e.g. leverpressing) [43,44], but this is no longer the casefor extensively trained responses ([30,31], but see [45]). That extensive trainingcan render an instrumental action independent of the value of its consequentoutcome has been regarded as the experimental parallel of the folk psychologymaxim that wellperformed actions become habitual [9] (see Figure I).
Niv, Joel & Dayan: A normative perspective on motivation. TICS,10:375-381, 2006.
8/??
Model-based RL
• Model learning:Given data D = {(st, at, rt, st+1)}Ht=1 estimate P (s′|a, s) and P (r|a, s)
– discrete state-action: P (s′|a, s) = #(s′,a,s)#(a,s)
– continuous state-action: P (s′|a, s) = N(s′ |φ(s, a)>β,Σ)
estimate parameters β (and perhaps Σ) as for regression(including non-linear features, regularization, cross-validation!)
• Planning:– discrete state-action: Value Iteration with the estimated model– continuous state-action:
Least Squares Value IterationDifferential Dynamic ProgrammingPlanning-by-Inference
9/??
Exploration: Motivation
Start: many potential paths
10/??
Exploration: Motivation
Continue using this path?
Try out alternatives?11/??
Exploration: Motivation
12/??
Exploration: Motivation
13/??
Reinforcement learning
• In reinforcement learning (RL), the agent starts to act without a modelof the environment.
• The agent has to learn from its experience what to do to in order tofulfill tasks and achieve high rewards.
• RL algorithms we have seen thus far: Q-learning, TD-learning
• Note the difference to the problem of adapting the behavior based on agiven model (also called planning / solving an MDP / calculatingoptimal state and action values). This is a computational subproblem inmodel-based RL.
• Planning algorithms we have seen thus far: value iteration, policyiteration.
14/??
Reinforcement learning
• In reinforcement learning (RL), the agent starts to act without a modelof the environment.
• The agent has to learn from its experience what to do to in order tofulfill tasks and achieve high rewards.
• RL algorithms we have seen thus far: Q-learning, TD-learning
• Note the difference to the problem of adapting the behavior based on agiven model (also called planning / solving an MDP / calculatingoptimal state and action values). This is a computational subproblem inmodel-based RL.
• Planning algorithms we have seen thus far: value iteration, policyiteration.
14/??
Exploration / Exploitation
• In contrast to supervised learning, in RL the data used for learningdepend on the agent.
• Two different types of behavior:– exploration: behave with the goal to learn as much as possible– exploitation: behave with the goal of getting as much reward aspossible
• Challenge in exploration: which actions will lead to the most importantlearning progress with respect to the goal?→ exploration as fundamental intelligent behavior
15/??
Recall Markov Decision Processesa0
s0
r0
a1
s1
r1
a2
s2
r2
P (s0:T , a0:T , r0:T ;π) =
P (s0)P (a0|s0;π)P (r0|a0, s0)∏Tt=1 P (st|at-1, st-1)P (at|st;π)P (rt|at, st)
– world’s initial state distribution P (s0)
– world’s transition probabilities P (st+1 | at, st)– world’s reward probabilities P (rt | at, st)– agent’s policy π(at | st) (or deterministic at = π(st))– discount parameter γ for future rewards
– two different sources of uncertainty: the world itself (not controlled bythe agent) vs. the policy (controlled by the agent)
16/??
Exploration-exploitation tradeoff
• Goal of reinforcement learning agent:maximize future rewards E[
∑∞t=0 γ
trt | s0;π]
• However, the agent does not know the transition parametersP (st+1 | at, st) and reward parameters P (rt | at, st) of the MDP.
• Rather, the agent needs to learn from its experiences0, a0, r0, s1, a1, r1, . . . which actions will lead to high rewards.
• Exploration-exploitation tradeoff: Which policy π(at | st) for actionselection shall the agent follow so that it does not miss the high-rewardstates, but does not spend too much time in low-reward states, either?– Exploitation: Prefer actions at which have led to reward before?– Exploration: Or rather take actions to learn more about the unknownMDP parameters and potentially find states with higher reward?
17/??
Exploration-exploitation tradeoff
• Goal of reinforcement learning agent:maximize future rewards E[
∑∞t=0 γ
trt | s0;π]
• However, the agent does not know the transition parametersP (st+1 | at, st) and reward parameters P (rt | at, st) of the MDP.
• Rather, the agent needs to learn from its experiences0, a0, r0, s1, a1, r1, . . . which actions will lead to high rewards.
• Exploration-exploitation tradeoff: Which policy π(at | st) for actionselection shall the agent follow so that it does not miss the high-rewardstates, but does not spend too much time in low-reward states, either?– Exploitation: Prefer actions at which have led to reward before?– Exploration: Or rather take actions to learn more about the unknownMDP parameters and potentially find states with higher reward?
17/??
Sample Complexity
• Let M be an MDP with N states, K actions, discount factor γ ∈ [0, 1)
and a maximal reward Rmax > 0.
• Let A be an algorithm (that is, a reinforcement learning agent) that actsin the environment, resulting in s0, a0, r0, s1, a1, r1, . . . .
• Let V At,M = E[
∑∞s=0 γ
srt+s | s0, a0, r0 . . . st−1, at−1, rt−1, st].• V ∗ is the value function of the optimal policy.
• Definition: Let ε > 0 be a prescribed accuracy and δ > 0 be anallowed probability of failure. The expression η(ε, δ,N,K, γ,Rmax) is asample complexity bound for algorithm A if independently of thechoice of s0, with probability at least 1− δ, the number of timestepssuch that V A
t,M < V ∗(xt)− ε is at most η(ε, δ,N,K, γ,Rmax).(Kakade, 2003)
18/??
Efficient exploration
• An algorithm with sample complexity that is polynomial in 1/ε, log(1/δ),N , K, 1/(1− γ), Rmax is called PAC-MDP (probably approximatelycorrect in MDPs).
19/??
Exploration strategies
• The exploration strategy is reflected in the policy π(s).
• In the following, assume we have estimates Q(s, a).
• greedy (only exploit): π(s) = argmaxa Q(s, a)
– problem: learned model not the same as the true environment– Without exploration, agent is likely to miss high rewards.
• random: choose action a with probability 1/{]actions}– problem: ignores value estimates and thus rewards
20/??
Exploration strategies (continued)
• ε-greedy: π(s) =
argmaxa Q(s, a) with probability 1− εrandom action with probability ε
– most popular method– Converges to the optimal value function with probability 1 (all pathswill be visited sooner or later), if the exploration rate diminishesaccording to an appropriate schedule.– problem: sample complexity exponential in the number of states
• Boltzmann: choose action a with probability exp(Q(s,a)/T )∑a exp(Q(s,a)/T )
– temperature T controls amount of exploration– problem again: sample complexity exponential in the number ofstates
21/??
Exploration strategies (continued)• Other heuristics for exploration:
– minimize variance of action value estimates– optimistic initial values (“optimism in the face of uncertainty”)– state bonuses: frequency, recency, error etc.→ problem again: sample complexity exponential in the number ofstates
• Bayesian RL: optimal exploration strategy– distribution over MDP models (i.e., the parameters of the MDP)– posterior distribution updated after each new observation– exploration strategy minimizes uncertainty of parameters– Bayes-optimal solution to the exploration-exploitation tradeoff (i.e., noother policy is better in expectation w.r.t. prior distribution over MDPs)– only tractable for very simple problems
• E3 and Rmax: principled approach to the exploration-exploitationtradeoff with polynomial sample complexity 22/??
Explicit-Exploit-or-Explore (E3) algorithmKearns and Singh (2002)
• Model-based approach with polynomial sample complexity (PAC-MDP)– uses optimism in the face of uncertainty– assumes knowledge of maximum reward
• Maintains counts for states and actions to quantify confidence in modelestimates– A state s is known if all actions in s have been sufficiently oftenexecuted.
• From observed data, E3 constructs two MDPs:
– MDPknown: includes known states with (approximately exact)estimates of P (st+1 | at, st) and P (rt | at, st)→ model which captures what you know
– MDPunknown: MDPknown + special state s′ where the agent receivesmaximum reward→ model which drives exploration
23/??
Explicit-Exploit-or-Explore (E3) algorithmKearns and Singh (2002)
• Model-based approach with polynomial sample complexity (PAC-MDP)– uses optimism in the face of uncertainty– assumes knowledge of maximum reward
• Maintains counts for states and actions to quantify confidence in modelestimates– A state s is known if all actions in s have been sufficiently oftenexecuted.
• From observed data, E3 constructs two MDPs:
– MDPknown: includes known states with (approximately exact)estimates of P (st+1 | at, st) and P (rt | at, st)→ model which captures what you know
– MDPunknown: MDPknown + special state s′ where the agent receivesmaximum reward→ model which drives exploration 23/??
E3 sketch
Input: State sOutput: Action a
if s is known thenPlan in MDPknown B Sufficiently accurate model estimatesif resulting plan has value above some threshold then
return first action of plan B Exploitationelse
Plan in MDPunknownreturn first action of plan B Planned exploration
end ifelse
return action with the least observations in s B Direct explorationend if
24/??
E3 example
S. Singh (Tutorial 2005)
25/??
E3 example
S. Singh (Tutorial 2005)
26/??
E3 example
M : true known state MDP M: estimated known state MDPS. Singh (Tutorial 2005)
27/??
E3
Implementation Setting
• T is the time horizon.
• GTmax is the maximum T-step return. (discounted case GTmax ≤ TRmax).
• A state is known if it was visited O(
(NTGTmax/ε)4V armax log(1/δ)
)times. (V armax is the maximum variance of the random payoffs over allstates).
• For the exploration/exploitation choice at known states: It’s assumed tobe given the optimal value function V ∗. If V obtained from theMDPknown > (V ∗ − ε) then do exploitation.
28/??
RMAX
• R-MAX solves only one unique model (don’t separate MDPknown andMDPunknown) and therefore implicitly explores or exploits.
• The R-MAX and E3 algorithms were able to achieve roughly the samelevel of performance (Strehl’s thesis).
• RMAX builds an approximate MDP based on reward function
R(s, a) =
R(s, a) if (s,a) known
Rmax otherwise
29/??
RMAX
• R-MAX solves only one unique model (don’t separate MDPknown andMDPunknown) and therefore implicitly explores or exploits.
• The R-MAX and E3 algorithms were able to achieve roughly the samelevel of performance (Strehl’s thesis).
• RMAX builds an approximate MDP based on reward function
R(s, a) =
R(s, a) if (s,a) known
Rmax otherwise
29/??
RMAX’s sketch
Initialize all couter n(s, a) = 0, n(s, a, s′) = 0.Initialize T (s′|s, a) = Is=s′ , R(s, a) = Rmax
while (1) doCompute policy πt using MDP model of (T , R).Choose a = πt(s), observe s′, r.n(s, a) = n(s, a) + 1
r(s, a) = r(s, a) + r
n(s, a, s′) = n(s, a, s′) + 1
if n(s, a) = m thenUpdate T (·|s, a) = n(s, a, ·)/n(s, a), and R(s, a) = r(s, a)/n(s, a).
end ifend while
30/??
RMAX Analysis
• The general PAC-MDP theorem does not easily adapt to the analysisof E3 because of E3 s use of two internal models (Original analysisdepends on horizon and mixing time).
• (Upper bound) There exists m = O(NT 2
ε2 ln2 NAδ
), then with probability
of at least 1− δ, V (st) ≥ V ∗(st)− ε is true for all but
O
(N2AT 3
ε3ln2 NA
δ
)
where N is the number of states.
• For discounted case: T = − log ε1−γ
31/??
Limitations of E3 and RMAX
• E3/RMAX is called “efficient” because its sample complexity scalesonly polynomially in the number of states.
• In natural environments, however, this number of states is enormous: itis exponential in the number of objects in the environment.
• Hence E3/RMAX scales exponentially in the number of objects.
• Generalization over states and actions is crucial for exploration.
32/??
KWIKKWIK (Known What It Knows): a supervised-learning model
• Input set: X
• Output set: Y
• Observation set: Z
• Hypothesis class: H : X 7→ Y
• Target function: h∗ ∈ H• Special symbol: ? (I don’t know)
33/??
KWIK
34/??
KWIKExample: Coin Learning
• Predict Pr(head) ∈ [0, 1] for a coin
– from many observations: head or tail
• Algorithm
– Predict ? the first O(1/ε2 log(1/δ)) times
– Use empirical estimate afterwards
– The bound follows from Hoeffdings bound O(1/ε2 log(1/δ))Li et. al. ICML 2008.
35/??
KWIK-RMAX
• T (s′|s, a) and R(s, a) are TWIK-learned.
36/??
Bayesian Model-based RLEncode unknown prob. T(s, a, s′) with random variables θ
• θsas′ = Pr(s′|s, a): random variable in [0, 1].• θsa = Pr(·|s, a): multinomial distribution.
Pr(ns1 = n1, · · · , nsk = nk) =n!
n1! · · ·nk!pn1
1 · · · pnkk
37/??
POMDP FormulationBayesian model-based RL→ Partially Observable MDP (POMDP)
• State space: SP = S×Θ
• S: Observable MDP state space.
• Θ: All unknown model’s parameter (Unobservable)
38/??
POMDP Formulation
• Assume an RL agent in an MDP environment 〈S,A,T,R〉.• In model-based RL: each unknown transition probability is parameterized
by a parameter θas,s′ ∈ [0, 1].
• Then, its POMDP formulation 〈SP ,A,TP ,RP ,O,Z〉:
• a new state space Snew = S × {θas,s′}
• observation space O = S
• The transition TP (s, θ, a, s′, θ′) = Pr(s′, θ′|s, θ, a):
Pr(s′|s, θs,s′
a , a) = θs,s′
a and Pr(θ′|θ) = δθ(θ′).
• The observation function Z(s′, a, o) = Pr(o|s′, a):
Pr(o|s′, a) = δs′(o).
• The reward function RP (s, θ, a, s′, θ′) = R(s, a, s′).
39/??
POMDP Formulation (2)
• The belief over all unknown parameters θs,s′
a : b(θ) = Pr(θ). If assumingthat the belief prior is a product of Dirichlets, then a posterior is found in aclosed form. Then , the belief is written as
b(θ) =∏s,a
D(θsa;nsa) (1)
where each unknown distribution θsa per one pair (s, a) is represented by
one Dirichlet D(θsa;nsa) = k∏s′ θ
ns,s′a −1
s,a,s′ ; and nsa is a vector of parameters
{ns,s′
a }.• Then, the closed form of belief update operator after observing a
transition (s, a, s′) is
bs,s′
a (θ) = kθs,s′
a
∏s,a
D(θsa;nsa) =∏s,a
D(θsa;nsa + δs,a,s′(s, a, s′)) (2)
40/??
Bayesian Optimality
• The Bellman’s equations
V π(b, s) = R(b, s, π(b, s)) +
∫b′,s′
p(b′, s′|π(b, s), b, s)V π(b′, s′)
– typically, MDP structure is fixed; belief over the parameters– belief updated after each observation (s, a, r, s′)
– only tractable for very simple problems
• Bayes-optimal policy a = argmaxa
V (b, a)
– no other policy leads to more rewards in expectation w.r.t. priordistribution over MDPs– solves the exploration-exploitation tradeoff implicitly: minimizesuncertainty about the parameters, while exploiting where it is certain– is not PAC-MDP efficient!
41/??
Example: BRL for Banditsmany slides from the course of Autonomous System, Marc Toussaint WS13/14.
42/??
Bandits recap
• Let at ∈ {1, .., n} be the choice of machine at time tLet yt ∈ R be the outcome with mean 〈yat〉A policy or strategy maps all the history to a new choice:
π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at
• Problem: Find a policy π that
max〈T∑t=1
yt〉
ormax〈yT 〉
• “Two effects” of choosing a machine:– You collect more data about the machine→ knowledge– You collect reward
43/??
The Belief State• “Knowledge” can be represented in two ways:
– as the full history
ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]
– as the beliefbt(θ) = P (θ|ht)
where θ are the unknown parameters θ = (θ1, .., θn) of allmachines
• In the bandit case:– The belief factorizes bt(θ) = P (θ|ht) =
∏i bt(θi|ht)
e.g. for Gaussian bandits with constant noise, θi = µi
bt(µi|ht) = N(µi|yi, si)
e.g. for binary bandits, θi = pi, with prior Beta(pi|α, β):
bt(pi|ht) = Beta(pi|α+ ai,t, β + bi,t)
ai,t =
t−1∑s=1
[as= i][ys=0] , bi,t =
t−1∑s=1
[as= i][ys=1]44/??
The Belief MDP
P (b′|y, a, b) =
1 if b′ = b′[b,a,y]
0 otherwise, P (y|a, b) =
∫θa
b(θa) P (y|θa)
• The Belief MDP describes a different process: the interaction between theinformation available to the agent (bt or ht) and its actions, where the agentuses his current belief to anticipate observations, P (y|a, b).
• The belief (or history ht) is all the information the agent has avaiable; P (y|a, b)the “best” possible anticipation of observations. If it acts optimally in the BeliefMDP, it acts optimally in the original problem.
45/??
Example: POMDP Reduction to Belief MDPAn example of binary n-armed bandits
• State space Sb = {∀b|T}.• Transition function
T (b, a, b′) =∑r
p(b′|b, a, r)p(r|b, a)
= δb′=b(a,r=1)p(r|b, a) + δb′=b(a,r=0)(1− p(r|b, a))
where p(r|b, a) =∫p(r|µa, a)b(µa)dµa = αab/(α
ab + βab )
• Similarly, what is the reward function R(b, a) =?
• A well-defined belief MDP formulation: {Sb, A, T,R} (finite belief statespace, transition function, Bellman equations)
46/??
Conclusions
• RL agents need to solve the exploration-exploitation tradeoff.
• Sample complexity measures the required number of explorativeactions of an algorithm.
• Ideas for driving exploration: random actions, optimism in the face ofuncertainty, maximizing learning progress and information gain
47/??
References
• Kakade (2003): On the sample complexity of reinforcement learning.PhD thesis.
• Poupart, Vlassis, Hoey, Kevin Regan: An analytic solution to discreteBayesian reinforcement learning. ICML 2006.
• Li, Littman, Walsh: Knows what it knows: a framework for self-awarelearning. ICML 2008.
48/??