Reinforcement Learning Lecture Model-Based Reinforcement ......2013/12/06 · Wolfgang Kohler...

ReinforcementLearning

Model-Based ReinforcementLearning

Model-based, PAC-MDP, sample complexity,exploration/exploitation, RMAX, E3,

Bayes-optimal, Bayesian RL, model learning

Vien NgoMarc Toussaint

University of Stuttgart

Outline

• Model-Based RL

• Exploration/Exploitation

• PAC-MDP

• E3, RMAX

• Bayes optimal: Bayesian model-based RL

RL Approaches

Limitations of the model-free view

• given learnt values, behavior is a fixed SR (or state-action) mapping

• if the “goal” changes: need to re-learn values/policy for every state inthe world! all previous values are obsolete

• no general “knowledge”, only values

• no anticipation of outcomes (s′), only of value

• no “planning”

Wolfgang Kohler (1917)Intelligenzprufungen amMenschenaffenThe Mentality of Apes

[movie]

Goal-directed Decision Making

Wolfgang Kohler (1917)Intelligenzprufungen amMenschenaffenThe Mentality of Apes

[movie]

Goal-directed Decision Making

Goal-directed vs. habitual: Devaluation

Niv, Joel & Dayan: A normative perspective on motivation. TICS,10:375-381, 2006.

Goal-directed vs. habitual: Devaluation

By definition, goal-directed behavior is performed to obtain a desired goal.Although all instrumental behavior is instrumental in achieving its contingentgoals, it is not necessarily purposively goal-directed. Dickinson and Balleine[1,11] proposed that behavior is goal-directed if: (i) it is sensitive to thecontingency between action and outcome, and (ii) the outcome is desired.Based on the second condition, motivational manipulations have been used todistinguish between two systems of action control: if an instrumental outcomeis no longer a valued goal (for instance, food for a sated animal) and thebehavior persists, it must not be goaldirected. Indeed, after moderate amountsof training, outcome revaluation brings about an appropriate change ininstrumental actions (e.g. leverpressing) [43,44], but this is no longer the casefor extensively trained responses ([30,31], but see [45]). That extensive trainingcan render an instrumental action independent of the value of its consequentoutcome has been regarded as the experimental parallel of the folk psychologymaxim that wellperformed actions become habitual [9] (see Figure I).

Model-based RL

• Model learning:Given data D = {(st, at, rt, st+1)}Ht=1 estimate P (s′|a, s) and P (r|a, s)

– discrete state-action: P (s′|a, s) = #(s′,a,s)#(a,s)

– continuous state-action: P (s′|a, s) = N(s′ |φ(s, a)>β,Σ)

estimate parameters β (and perhaps Σ) as for regression(including non-linear features, regularization, cross-validation!)

• Planning:– discrete state-action: Value Iteration with the estimated model– continuous state-action:

Least Squares Value IterationDifferential Dynamic ProgrammingPlanning-by-Inference

Exploration: Motivation

Start: many potential paths

Continue using this path?

Try out alternatives?11/??

Reinforcement learning

• In reinforcement learning (RL), the agent starts to act without a modelof the environment.

• The agent has to learn from its experience what to do to in order tofulfill tasks and achieve high rewards.

• RL algorithms we have seen thus far: Q-learning, TD-learning

• Note the difference to the problem of adapting the behavior based on agiven model (also called planning / solving an MDP / calculatingoptimal state and action values). This is a computational subproblem inmodel-based RL.

• Planning algorithms we have seen thus far: value iteration, policyiteration.

Reinforcement learning

• In reinforcement learning (RL), the agent starts to act without a modelof the environment.

• The agent has to learn from its experience what to do to in order tofulfill tasks and achieve high rewards.

• RL algorithms we have seen thus far: Q-learning, TD-learning

• Note the difference to the problem of adapting the behavior based on agiven model (also called planning / solving an MDP / calculatingoptimal state and action values). This is a computational subproblem inmodel-based RL.

• Planning algorithms we have seen thus far: value iteration, policyiteration.

Exploration / Exploitation

• In contrast to supervised learning, in RL the data used for learningdepend on the agent.

• Two different types of behavior:– exploration: behave with the goal to learn as much as possible– exploitation: behave with the goal of getting as much reward aspossible

• Challenge in exploration: which actions will lead to the most importantlearning progress with respect to the goal?→ exploration as fundamental intelligent behavior

Recall Markov Decision Processesa0

P (s0:T , a0:T , r0:T ;π) =

– world’s initial state distribution P (s0)

– world’s transition probabilities P (st+1 | at, st)– world’s reward probabilities P (rt | at, st)– agent’s policy π(at | st) (or deterministic at = π(st))– discount parameter γ for future rewards

– two different sources of uncertainty: the world itself (not controlled bythe agent) vs. the policy (controlled by the agent)

Exploration-exploitation tradeoff

• Goal of reinforcement learning agent:maximize future rewards E[

∑∞t=0 γ

trt | s0;π]

• However, the agent does not know the transition parametersP (st+1 | at, st) and reward parameters P (rt | at, st) of the MDP.

• Rather, the agent needs to learn from its experiences0, a0, r0, s1, a1, r1, . . . which actions will lead to high rewards.

• Exploration-exploitation tradeoff: Which policy π(at | st) for actionselection shall the agent follow so that it does not miss the high-rewardstates, but does not spend too much time in low-reward states, either?– Exploitation: Prefer actions at which have led to reward before?– Exploration: Or rather take actions to learn more about the unknownMDP parameters and potentially find states with higher reward?

Exploration-exploitation tradeoff

• Goal of reinforcement learning agent:maximize future rewards E[

∑∞t=0 γ

trt | s0;π]

• However, the agent does not know the transition parametersP (st+1 | at, st) and reward parameters P (rt | at, st) of the MDP.

• Rather, the agent needs to learn from its experiences0, a0, r0, s1, a1, r1, . . . which actions will lead to high rewards.

• Exploration-exploitation tradeoff: Which policy π(at | st) for actionselection shall the agent follow so that it does not miss the high-rewardstates, but does not spend too much time in low-reward states, either?– Exploitation: Prefer actions at which have led to reward before?– Exploration: Or rather take actions to learn more about the unknownMDP parameters and potentially find states with higher reward?

Sample Complexity

• Let M be an MDP with N states, K actions, discount factor γ ∈ [0, 1)

and a maximal reward Rmax > 0.

• Let A be an algorithm (that is, a reinforcement learning agent) that actsin the environment, resulting in s0, a0, r0, s1, a1, r1, . . . .

• Let V At,M = E[

∑∞s=0 γ

srt+s | s0, a0, r0 . . . st−1, at−1, rt−1, st].• V ∗ is the value function of the optimal policy.

• Definition: Let ε > 0 be a prescribed accuracy and δ > 0 be anallowed probability of failure. The expression η(ε, δ,N,K, γ,Rmax) is asample complexity bound for algorithm A if independently of thechoice of s0, with probability at least 1− δ, the number of timestepssuch that V A

t,M < V ∗(xt)− ε is at most η(ε, δ,N,K, γ,Rmax).(Kakade, 2003)

Efficient exploration

• An algorithm with sample complexity that is polynomial in 1/ε, log(1/δ),N , K, 1/(1− γ), Rmax is called PAC-MDP (probably approximatelycorrect in MDPs).

Exploration strategies

• The exploration strategy is reflected in the policy π(s).

• In the following, assume we have estimates Q(s, a).

• greedy (only exploit): π(s) = argmaxa Q(s, a)

– problem: learned model not the same as the true environment– Without exploration, agent is likely to miss high rewards.

• random: choose action a with probability 1/{]actions}– problem: ignores value estimates and thus rewards

Exploration strategies (continued)

• ε-greedy: π(s) =

argmaxa Q(s, a) with probability 1− εrandom action with probability ε

– most popular method– Converges to the optimal value function with probability 1 (all pathswill be visited sooner or later), if the exploration rate diminishesaccording to an appropriate schedule.– problem: sample complexity exponential in the number of states

• Boltzmann: choose action a with probability exp(Q(s,a)/T )∑a exp(Q(s,a)/T )

– temperature T controls amount of exploration– problem again: sample complexity exponential in the number ofstates

Exploration strategies (continued)• Other heuristics for exploration:

– minimize variance of action value estimates– optimistic initial values (“optimism in the face of uncertainty”)– state bonuses: frequency, recency, error etc.→ problem again: sample complexity exponential in the number ofstates

• Bayesian RL: optimal exploration strategy– distribution over MDP models (i.e., the parameters of the MDP)– posterior distribution updated after each new observation– exploration strategy minimizes uncertainty of parameters– Bayes-optimal solution to the exploration-exploitation tradeoff (i.e., noother policy is better in expectation w.r.t. prior distribution over MDPs)– only tractable for very simple problems

• E3 and Rmax: principled approach to the exploration-exploitationtradeoff with polynomial sample complexity 22/??

Explicit-Exploit-or-Explore (E3) algorithmKearns and Singh (2002)

• Model-based approach with polynomial sample complexity (PAC-MDP)– uses optimism in the face of uncertainty– assumes knowledge of maximum reward

• Maintains counts for states and actions to quantify confidence in modelestimates– A state s is known if all actions in s have been sufficiently oftenexecuted.

• From observed data, E3 constructs two MDPs:

– MDPknown: includes known states with (approximately exact)estimates of P (st+1 | at, st) and P (rt | at, st)→ model which captures what you know

– MDPunknown: MDPknown + special state s′ where the agent receivesmaximum reward→ model which drives exploration

Explicit-Exploit-or-Explore (E3) algorithmKearns and Singh (2002)

• Model-based approach with polynomial sample complexity (PAC-MDP)– uses optimism in the face of uncertainty– assumes knowledge of maximum reward

• Maintains counts for states and actions to quantify confidence in modelestimates– A state s is known if all actions in s have been sufficiently oftenexecuted.

• From observed data, E3 constructs two MDPs:

– MDPknown: includes known states with (approximately exact)estimates of P (st+1 | at, st) and P (rt | at, st)→ model which captures what you know

– MDPunknown: MDPknown + special state s′ where the agent receivesmaximum reward→ model which drives exploration 23/??

E3 sketch

Input: State sOutput: Action a

if s is known thenPlan in MDPknown B Sufficiently accurate model estimatesif resulting plan has value above some threshold then

return first action of plan B Exploitationelse

Plan in MDPunknownreturn first action of plan B Planned exploration

end ifelse

return action with the least observations in s B Direct explorationend if

E3 example

S. Singh (Tutorial 2005)

E3 example

S. Singh (Tutorial 2005)

E3 example

M : true known state MDP M: estimated known state MDPS. Singh (Tutorial 2005)

Implementation Setting

• T is the time horizon.

• GTmax is the maximum T-step return. (discounted case GTmax ≤ TRmax).

• A state is known if it was visited O(

(NTGTmax/ε)4V armax log(1/δ)

)times. (V armax is the maximum variance of the random payoffs over allstates).

• For the exploration/exploitation choice at known states: It’s assumed tobe given the optimal value function V ∗. If V obtained from theMDPknown > (V ∗ − ε) then do exploitation.

• R-MAX solves only one unique model (don’t separate MDPknown andMDPunknown) and therefore implicitly explores or exploits.

• The R-MAX and E3 algorithms were able to achieve roughly the samelevel of performance (Strehl’s thesis).

• RMAX builds an approximate MDP based on reward function

R(s, a) =

R(s, a) if (s,a) known

Rmax otherwise

• R-MAX solves only one unique model (don’t separate MDPknown andMDPunknown) and therefore implicitly explores or exploits.

• The R-MAX and E3 algorithms were able to achieve roughly the samelevel of performance (Strehl’s thesis).

• RMAX builds an approximate MDP based on reward function

R(s, a) =

R(s, a) if (s,a) known

Rmax otherwise

RMAX’s sketch

Initialize all couter n(s, a) = 0, n(s, a, s′) = 0.Initialize T (s′|s, a) = Is=s′ , R(s, a) = Rmax

while (1) doCompute policy πt using MDP model of (T , R).Choose a = πt(s), observe s′, r.n(s, a) = n(s, a) + 1

r(s, a) = r(s, a) + r

n(s, a, s′) = n(s, a, s′) + 1

if n(s, a) = m thenUpdate T (·|s, a) = n(s, a, ·)/n(s, a), and R(s, a) = r(s, a)/n(s, a).

end ifend while

RMAX Analysis

• The general PAC-MDP theorem does not easily adapt to the analysisof E3 because of E3 s use of two internal models (Original analysisdepends on horizon and mixing time).

• (Upper bound) There exists m = O(NT 2

ε2 ln2 NAδ

), then with probability

of at least 1− δ, V (st) ≥ V ∗(st)− ε is true for all but

(N2AT 3

ε3ln2 NA

where N is the number of states.

• For discounted case: T = − log ε1−γ

Limitations of E3 and RMAX

• E3/RMAX is called “efficient” because its sample complexity scalesonly polynomially in the number of states.

• In natural environments, however, this number of states is enormous: itis exponential in the number of objects in the environment.

• Hence E3/RMAX scales exponentially in the number of objects.

• Generalization over states and actions is crucial for exploration.

KWIKKWIK (Known What It Knows): a supervised-learning model

• Input set: X

• Output set: Y

• Observation set: Z

• Hypothesis class: H : X 7→ Y

• Target function: h∗ ∈ H• Special symbol: ? (I don’t know)

KWIKExample: Coin Learning

• Predict Pr(head) ∈ [0, 1] for a coin

– from many observations: head or tail

• Algorithm

– Predict ? the first O(1/ε2 log(1/δ)) times

– Use empirical estimate afterwards

– The bound follows from Hoeffdings bound O(1/ε2 log(1/δ))Li et. al. ICML 2008.

KWIK-RMAX

• T (s′|s, a) and R(s, a) are TWIK-learned.

Bayesian Model-based RLEncode unknown prob. T(s, a, s′) with random variables θ

• θsas′ = Pr(s′|s, a): random variable in [0, 1].• θsa = Pr(·|s, a): multinomial distribution.

Pr(ns1 = n1, · · · , nsk = nk) =n!

n1! · · ·nk!pn1

1 · · · pnkk

POMDP FormulationBayesian model-based RL→ Partially Observable MDP (POMDP)

• State space: SP = S×Θ

• S: Observable MDP state space.

• Θ: All unknown model’s parameter (Unobservable)

POMDP Formulation

• Assume an RL agent in an MDP environment 〈S,A,T,R〉.• In model-based RL: each unknown transition probability is parameterized

by a parameter θas,s′ ∈ [0, 1].

• Then, its POMDP formulation 〈SP ,A,TP ,RP ,O,Z〉:

• a new state space Snew = S × {θas,s′}

• observation space O = S

• The transition TP (s, θ, a, s′, θ′) = Pr(s′, θ′|s, θ, a):

Pr(s′|s, θs,s′

a , a) = θs,s′

a and Pr(θ′|θ) = δθ(θ′).

• The observation function Z(s′, a, o) = Pr(o|s′, a):

Pr(o|s′, a) = δs′(o).

• The reward function RP (s, θ, a, s′, θ′) = R(s, a, s′).

POMDP Formulation (2)

• The belief over all unknown parameters θs,s′

a : b(θ) = Pr(θ). If assumingthat the belief prior is a product of Dirichlets, then a posterior is found in aclosed form. Then , the belief is written as

b(θ) =∏s,a

D(θsa;nsa) (1)

where each unknown distribution θsa per one pair (s, a) is represented by

one Dirichlet D(θsa;nsa) = k∏s′ θ

ns,s′a −1

s,a,s′ ; and nsa is a vector of parameters

{ns,s′

a }.• Then, the closed form of belief update operator after observing a

transition (s, a, s′) is

bs,s′

a (θ) = kθs,s′

∏s,a

D(θsa;nsa) =∏s,a

D(θsa;nsa + δs,a,s′(s, a, s′)) (2)

Bayesian Optimality

• The Bellman’s equations

V π(b, s) = R(b, s, π(b, s)) +

∫b′,s′

p(b′, s′|π(b, s), b, s)V π(b′, s′)

– typically, MDP structure is fixed; belief over the parameters– belief updated after each observation (s, a, r, s′)

– only tractable for very simple problems

• Bayes-optimal policy a = argmaxa

V (b, a)

– no other policy leads to more rewards in expectation w.r.t. priordistribution over MDPs– solves the exploration-exploitation tradeoff implicitly: minimizesuncertainty about the parameters, while exploiting where it is certain– is not PAC-MDP efficient!

Example: BRL for Banditsmany slides from the course of Autonomous System, Marc Toussaint WS13/14.

Bandits recap

• Let at ∈ {1, .., n} be the choice of machine at time tLet yt ∈ R be the outcome with mean 〈yat〉A policy or strategy maps all the history to a new choice:

π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at

• Problem: Find a policy π that

max〈T∑t=1

ormax〈yT 〉

• “Two effects” of choosing a machine:– You collect more data about the machine→ knowledge– You collect reward

The Belief State• “Knowledge” can be represented in two ways:

– as the full history

ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]

– as the beliefbt(θ) = P (θ|ht)

where θ are the unknown parameters θ = (θ1, .., θn) of allmachines

• In the bandit case:– The belief factorizes bt(θ) = P (θ|ht) =

∏i bt(θi|ht)

e.g. for Gaussian bandits with constant noise, θi = µi

bt(µi|ht) = N(µi|yi, si)

e.g. for binary bandits, θi = pi, with prior Beta(pi|α, β):

bt(pi|ht) = Beta(pi|α+ ai,t, β + bi,t)

ai,t =

t−1∑s=1

[as= i][ys=0] , bi,t =

t−1∑s=1

[as= i][ys=1]44/??

The Belief MDP

P (b′|y, a, b) =

1 if b′ = b′[b,a,y]

0 otherwise, P (y|a, b) =

∫θa

b(θa) P (y|θa)

• The Belief MDP describes a different process: the interaction between theinformation available to the agent (bt or ht) and its actions, where the agentuses his current belief to anticipate observations, P (y|a, b).

• The belief (or history ht) is all the information the agent has avaiable; P (y|a, b)the “best” possible anticipation of observations. If it acts optimally in the BeliefMDP, it acts optimally in the original problem.

Example: POMDP Reduction to Belief MDPAn example of binary n-armed bandits

• State space Sb = {∀b|T}.• Transition function

T (b, a, b′) =∑r

p(b′|b, a, r)p(r|b, a)

= δb′=b(a,r=1)p(r|b, a) + δb′=b(a,r=0)(1− p(r|b, a))

where p(r|b, a) =∫p(r|µa, a)b(µa)dµa = αab/(α

ab + βab )

• Similarly, what is the reward function R(b, a) =?

• A well-defined belief MDP formulation: {Sb, A, T,R} (finite belief statespace, transition function, Bellman equations)

Conclusions

• RL agents need to solve the exploration-exploitation tradeoff.

• Sample complexity measures the required number of explorativeactions of an algorithm.

• Ideas for driving exploration: random actions, optimism in the face ofuncertainty, maximizing learning progress and information gain

References

• Kakade (2003): On the sample complexity of reinforcement learning.PhD thesis.

• Poupart, Vlassis, Hoey, Kevin Regan: An analytic solution to discreteBayesian reinforcement learning. ICML 2006.

• Li, Littman, Walsh: Knows what it knows: a framework for self-awarelearning. ICML 2008.

Reinforcement Learning Lecture Model-Based Reinforcement ......2013/12/06 · Wolfgang Kohler...

Documents

Transcript of Reinforcement Learning Lecture Model-Based Reinforcement ......2013/12/06 · Wolfgang Kohler...

Causal Inference & Estimands - Universitätsmedizin MainzStatistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference and Personalized Medicine. Springer.

Reinforcement Learning 2

Dozent: Dr. Anja Leue Referenten: Florian Mantz, Stefan Schindelmann Datum: 03.05.2006 Reinforcement Sensitivtäts Theorie.

Reinforcement Learning · 5 V2.0 | J. Fürnkranz Reinforcement Learning in MENACE ... 16 V2.0 | J. Fürnkranz Policy Evaluation - Example Simplified task we don't know we don't know

SRM450v2 Active Sound Reinforcement SPEAKER ...€¦ · Sound-Reinforcement-Lautsprecher zu bauen mit: 1. großer Präzision, hoher Ausgangsleistung und akkurater Wiedergabe. 2. sehr

Reinforcement Learning mit N-Tupel-Systemen für Vier … · Kurzfassung Reinforcement Learning mit N-Tupel-Systemen für Vier Gewinnt S e i t e | 1 Kurzfassung Die Untersuchung maschineller

Freitag - Main-Post€¦ · Die Band bezeichnet ihren Musikstil selbst als „Kuschel-Trash“: Emotionale, ... Cranberries oder Guano Apes. Wohlgemerkt mit der Ambition, ausgetretene

INSTITUT FÜR LEISTUNGSELEKTRONISCHE SYSTEME · 04.07.2019 Tobias Schindler 1 INSTITUT FÜR LEISTUNGSELEKTRONISCHE SYSTEME Reinforcement Learning Tobias Schindler Technische Hochschule

Reinforcement Learning Kurvendurchfahrtxubicomp/projekte/master08... · Thema Einsatz von Reinforcement Learning in der Modellfahrzeugnavigation Konkreter Anwendungsfall: Einführung

Eine Entwicklungsstudie zur Dialogischen Didaktik · Denkprozess durch sinnliclie Medien an: Bilder von -41-Menschenaffen und alten Menschenrassen, die den Kindem die allm1ihliche

Reinforcement Learning

Durchmesser 18 63 - g-s.dk · ISO 15835-1, 2: 2009 Steels for the reinforcement of concrete - Reinforcement couplers for mechanical splices of bars Part-1: Requirements Part-2: Test

New Ministère de la Transition écologique et solidaire · 2014. 11. 17. · DREAL Rhône Apes Service CAE-DD Unité Autorité environnementale Courriel : ae-dreal-ra@developpement-durable.gouv.fr

Bar reinforcement plus Steel Fiber reinforced Concrete Design_Germany, Gossla

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Reinforcement Learning 2 Uwe Dick.

Reinforcement Learning - LMU Munich · Reinforcement Lernen •Reinforcement Lernen: Ein Agent handelt in einer Umwelt und erh¨alt als R¨uck-kopplung Belohnungen und Bestrafungen;

Ethik & Rechte für Menschenaffen & Rechte für... · Ethik und Rechte für Menschen & Menschenaffen in unserer Kultur Prof. Dr. Hans Werner Ingensiep Institut für Philosophie &

Reinforcement Learning in der Modellfahrzeugnavigationubicomp/projekte/master2009... · Reinforcement Learning in der Modellfahrzeugnavigation von Manuel Trittel Informatik HAW Hamburg

Reinforcement Learning 2 - cs.uni-potsdam.de · Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Reinforcement Learning 2 Uwe Dick

Psychologische Forschung - Tables des matières vol. 1-22 ... · Über eine neue Methode zur psychologischen Untersuchung von Menschenaffen ... Beobachtungen zu einem Experiment über