1. dia - u-szeged.hu

1. dia - u-szeged.hu

Megerstses tanuls pr. 27. Copyrights: Szepesvri Csaba: Megerstses tanuls (2004) Szita Istvn, Lrincz Andrs: Megerstses tanuls (2005) Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998) Megerstses tanuls (reinforcement learning) http://www.youtube.com/watch?v=mRpX9DFCdwI http://www.youtube.com/watch?v=VCdxqn0fcnE Megerstses tanuls

(reinforcement learning) Megerstses tanuls (reinforcement learning) Robot navigcis feladat Pavlov: Nomad 200 robot Nomad 200 simulator

Sridhar Mahadevan UMass Megerstses tanuls Kontroll/vezrlsi problmk Cl: tbb lpses akcisorozatok kialaktsa Interakcibl tanul, a megersts (bntets/jutalom) ltalban nem azonnali Clorientlt! A jutalom egy fggvnyt maximalizljuk. +3 +50 -1

-1 r1 r4 r5 r9 s1

s2 s3 s4 s5 s9 a1

a2 a3 a4 a5 a9 Felgyelt vs Megerstses

gpi tanuls Mindkett gpi tanulsi mdszer Felgyelt Megerstses Azonnali visszajelzs Ksleltetett indirekt visszajelzs Passzv tanuls (elre adott tant adatbzis)

Aktv tanuls (akcikat a rendszer vlasztja amire visszajelzst kapunk) Megerstses tanuls id: llapot: akci: jutalom: eljrsmd (policy, stratgia): determinisztikus: szochasztikus: (s,a) annak a valsznsge, hogy s-ben a-t lp

(vgtelen horizont) interakci: krnyezet modellje: tmeneti valsznsgek s jutalmak cl: maximlis vrhat jutalom: A Markov-feltevs feltesszk, hogy a rgmlt nem szmt: a krnyezet dinamikja lerhat az

tmenetivalsznsgmtrixszal: Markov Dntsi Folyamatok Markov Decision Processes (MDPs) llapotok, vletlentl fgg tmenetekkel tmenetvalsznsgek aktulis llapottl fggnek r=0 1 a1 a2

2 r=2 A felderts-kiaknzs dilemma (exploration exploitation) A k-kar bandita problma tlagos kifizets (jutalom) Akcik 10 0, 0, 5, 10, 35

5, 10, -15, -15, -10 -5 -20, 0, 50 gens 100 Ahhoz, hogy sok jutalmat kapjunk tudnunk kell milyen akcikkal szerezhetjk meg, azaz meg kell ismerni a krnyezetet (felderts), majd a tuds alapjn sszegyjteni a jutalmat (kiaknzs). Clfggvny

folytonos (vgtelen) feladat gond: rt vgtelen lehet! megolds: diszkontls. rt helyett t rt , <1 garantltan vges diszkontls knyelmes Markov dntsi folyamat megoldsa krnyezet lpked P s R szerint: gens lpked szerint:

optimlis eljrsmd: olyan , amelyre maximlis. Hossztv jutalom gens politikja rgztett: Az Rt kifizets a t pillanat utni sszjutalomalapjn +3 +50 -1 r1

r4 -1 r5 r9 llapot hasznossga (rtke) = Vrhat kifizets Rt valsznsgi vltoz Vehetjk a vrhat rtkt! Politiktl fgg Rt !

V()-t rtkelfggvnynek hvjuk Feladat: talljuk meg azt a eljrsmdot amelyik a vrhat rtket maximalizlja, minden llapotban Az eddigi sztori.. Tbb lpses dntsi feladatok Cl *-ot megtallni A minden llapotban optimlis biztostja a legtbb hossztv jutalmat at at+1

st at+2 st+1 rt+1 st+2 rt+2 st+3 rt+3

17 A Bellman egyenletek A Markov tulajdonsg miatt a vrhat sszjutalom egy rekurzv egyenlettel is kifejezhet: 4 (s) s 3 5

Eljrsmdok sszehasonltsa 1 2, ha rszbenrendezs * optimlis, ha * minden eljrsmdra mindig ltezik ilyen Plda: egy nagyon egyszer MDP -10 2 B

A -10 1 2 -10 1 -10 1 -10

C D -10 2 2 1 +100 cl 4 llapot, 2 akci

10% esllyel rossz irnyba megy Plda (A,1) = 1A,1) = 1) = 1) = 1 (A,1) = 1B,1) = 1) = 1) = 1 (A,1) = 1C,1) = 1) = 1) = 1 (A,1) = 1D,1) = 1) = 1) = 1 =0 (A,1) = 1A,2) = 0) = 0 (A,1) = 1B,2) = 0) = 0 (A,1) = 1C,2) = 0) = 0 (A,1) = 1D,2) = 0)

A 2 C -102 -101 1 -10 1 -10 2

B 1 -102 D 2 1 +100 cl Plda

Plda megolds: 2 stratgia: mindig 2-t lp Plda: egy 3. eljrsmd (A,1) = A,1) = 0,4 3(A,1) = B,1) = 1 3(A,1) = C,1) = 0 3(A,1) = D,1) = 1 (A,1) = A,2) = ) = 0,6 3(A,1) = B,2) = ) = 0

3(A,1) = C,2) = ) = 1 3(A,1) = D,2) = ) = 0 -10 2 B A -10 1 2 1 1

-10 C -10 2 1 D -10 2

2 1 +100 cl Plda: egy 3. eljrsmd Plda: egy 3. eljrsmd megolds: Plda: sszehasonlts 1) = 1

2) = 0 3 A 75.61) = 1 75.61) = 1 77.78 B

87.56 68.05 87.78 C 68.05 87.56 87.78

D 1) = 100 1) = 100 1) = 100 1 3 s 2 3 3 optimlis eljrsmd sok optimlis eljrsmd van! az optimlis rtkelfggvny (V) egyrtelm

Az optimlis rtkelfggvny Bellman-egyenlete Optimlis rtkel fggvny Moh eljrsmd: mindig a Q* szerinti legjobb akcit vlasztja: argmaxa Q*(s,a) Ez optimlis eljrsmd!!! Az optimlis rtkelfggvny Bellman-egyenlete nemlineris! van egyrtelm megoldsa

megoldja a hossztv tervezs problmjt MDP megoldsa dinamikus programozssal Tfh. P s R ismert Kerssk -t Eljrsmd iterci rtkiterci Eljrsmd iterci

Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables with parameter . Suppose is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be 0.9 and formulate this as a continuing finite MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight. rtkiterci

Eljrsmditerci vs. rtkiterci melyik jobb? eljrsmditercinak kevesebb lps elg de azok a lpsek sokig tartanak rtkiterci polinom idben optimlis rtkelfggvnyhez konvergl Eljrsmditerci: konvergl, de nem ismert, hogy polinomilis-e gyakorlatban: problmafgg

ltalnos eljrsmd iterci Eljrsmd kirtkelse modell (P s R) nlkl keressk V-tt R(A,1) = s): nyeresg s-bl indulva, valsznsgi vltoz vrhat rtke: V(A,1) = s) V(A,1) = 1s) becslse Monte Carlo mdszer, MC R(s) modell nlkl szmthat, szimulcival

tapasztalati tlag: vesznk N darab s-bl indul utat (epizd), a nyeresgek: Monte Carlo rtkelbecsls Az idbeli differencik mdszere (Temporal Differences, TD) becslsnk hibja: Elnye: nem kell megvrni az epizd vgt (szemben az MC-vel)

a becslshez egy msik becslst hasznlunk Az idbeli differencik mdszere rtkelbecslsre sszehasonlts: DP, MC, TD 3 mdszer V becslsre: DP: a krnyezet modellje (P s R) ismert A vrhat rtket pontosan szmoljuk MC: kzelt megolds, szimullunk epizdokat

frissts csak az epizd vgn TD: frissts a szimulci egyetlen lpse alapjn a mintavtel zajos, ezrt csak -nyi mrtkben vesszk figyelembe Egy explorcis stratgia (Sarsa) Moh akci 1-valsznsggel Vletlen akci valsznsggel Az explorcis stratgia

javtsa az -moh stratgia nagyon rossz! a felfedez lpsek vletlen bolyongsok plda jobb mdszerre: explorcis bnuszok jutalom, ha ritkn ltogatott llapotba jut az gynk jutalom pl. legutbbi ltogats ideje, TD hiba nagysga, stb. egyszer mdszer a felderts btortsra: optimista kezdrtkek

eleinte minden akcit vgigprbl, mert sok jutalmat reml Regresszi alap RL Ha az llapotok s akcik szma tl nagy kezelhetetlen lesz a problma tl sok epizd kell a j becslshez Eddig csak diszkrt llapot s akciterekrl besztnk (folytonos esetek?) dS V (s | ) :

Q ( s, a | ) : d S d A Egy klnsen sikeres plda: TD-gammon TD() tanuls, 1 rejtett rteg neuronhl, Backprop 1,500,000 jtk (sajt magval) A legjobb jtkosokkal azonos kpessgek (vilgbajnok) Backgammon llapottere: ~1020 , DP

nem megy!!

Recently Viewed Presentations

  • Welcome to Open House September 9, 2015

    Welcome to Open House September 9, 2015

    All work will be in their word study notebook that can be taken home to study. We will also be starting vocabulary with four words per week. Please look for these in the word study notebook as well as ....
  • On Knowing a Language - Linguistics

    On Knowing a Language - Linguistics

    Criticisms to Skinner This theory cannot explain creativity in generating language: language behavior is more complex than the establishment of S-R connections Lack of empirical evidence: the role of imitation and reinforcement is smaller Universal Grammar Theory (Chomsky) Language is...
  • Grammar Rules - PC&#92;|MAC

    Grammar Rules - PC\|MAC

    TOP 10. Never use an apostrophe to create a plural. An apostrophe is used to show ownership or to make a contraction. Example: I'm the best teacher you'll ever have.
  • Prayer &amp; Personal Bible Study - Church Leadership Resources

    Prayer & Personal Bible Study - Church Leadership Resources

    Prayer & Personal Bible Study. Prayer & Personal Bible Study. ... how much more will your Father who is in heaven give good things to those who ask Him! ... In this manner, therefore, pray: Our Father in heaven, Hallowed...
  • Competition and Specialization in the Hospital Industry: An ...

    Competition and Specialization in the Hospital Industry: An ...

    Duopoly. Two stage game (1. st. stage service mix, 2. nd. stage qualities) Logically consistent for a firm consider "what to produce" first, then determine "quality of service" Simultaneous game will yield the same result as the erogenous quality case....
  • NTC - American Public Transportation Association

    NTC - American Public Transportation Association

    Feasible in higher density corridors, routes with strong anchors on both ends and park-and-ride based peak-period commuter service. 16 - 30. 2 - 4. ... Operator has made a policy decision to emphasize coverage over cost-efficiency.
  • RSPV Resolve to Stop the Violence Programme

    RSPV Resolve to Stop the Violence Programme

    Cognitive Triangle. Situation. Thoughts. Feelings. Actions. Control Log. QUOTE "There comes a point where we need to stop just pulling people out of the river.We need to go upstream and find out why they're falling in." ― Desmond Tutu. RSVP....
  • Orthographic Projection Review: -Projections -Orthographic projections Home work:

    Orthographic Projection Review: -Projections -Orthographic projections Home work:

    Ken Youssefi * Mechanical and Aerospace Engineering Dept., SJSU GLASS BOX UNFOLDED. To be turned in on Duplicate the drawing. Draw the views of the object, sides of the unfolded box represented as rectangles and construction lines connecting object views....