分销赏收藏举报申诉 / 79

立即下载开通VIP

当前位置：首页 > 包罗万象 > 大杂烩 > 强化学习(ppt文档可编辑修改).ppt

强化学习(ppt文档可编辑修改).ppt

上传人：快乐****生活

文档编号：2670832

上传时间：2024-06-04

格式：PPT

页数：79

大小：1.30MB

《强化学习(ppt文档可编辑修改).ppt》由会员分享，可在线阅读，更多相关《强化学习(ppt文档可编辑修改).ppt（79页珍藏版）》请在咨信网上搜索。

1、高级人工智能第十章第十章史忠植史忠植史忠植史忠植中国科学院计算技术研究所中国科学院计算技术研究所中国科学院计算技术研究所中国科学院计算技术研究所强化学习2024/5/25 周六1强化学习史忠植内容提要l引言引言l强化学习模型强化学习模型l动态规划动态规划l蒙特卡罗方法蒙特卡罗方法l时序差分学习时序差分学习lQ学习学习l强化学习中的函数估计强化学习中的函数估计l应用应用2024/5/25 周六2强化学习史忠植引言人类通常从与外界环境的交互中学习。所谓强化（reinforcement）学习是指从环境状态到行为映射的学习，以使系统行为从环境中获得的累积奖励值最大。在强化学习中，我们设计算法来

2、把外界环境转化为最大化奖励量的方式的动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体通过看哪个动作得到了最多的奖励来自己发现。主体的动作的影响不只是立即得到的奖励，而且还影响接下来的动作和最终的奖励。试错搜索(trial-and-error search)和延期强化(delayed reinforcement)这两个特性是强化学习中两个最重要的特性。2024/5/25 周六3强化学习史忠植引言强化学习技术是从控制理论、统计学、心理学等相关学科发展而来，最早可以追溯到巴甫洛夫的条件反射实验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中

3、得到广泛研究和应用，并被认为是设计智能系统的核心技术之一。特别是随着强化学习的数学基础研究取得突破性进展后，对强化学习的研究和应用日益开展起来，成为目前机器学习领域的研究热点之一。2024/5/25 周六4强化学习史忠植引言引言o强化思想最先来源于心理学的研究。1911年Thorndike提出了效果律（Law of Effect）：一定情景下让动物感到舒服的行为，就会与此情景增强联系（强化），当此情景再现时，动物的这种行为也更易再现；相反，让动物感觉不舒服的行为，会减弱与情景的联系，此情景再现时，此行为将很难再现。换个说法，哪种行为会“记住”，会与刺激建立联系，取决于行为产生的效果。o动物的

4、试错学习,包含两个含义：选择（selectional）和联系（associative），对应计算上的搜索和记忆。所以，1954年，Minsky在他的博士论文中实现了计算上的试错学习。同年，Farley和Clark也在计算上对它进行了研究。强化学习一词最早出现于科技文献是1961年Minsky 的论文“Steps Toward Artificial Intelligence”，此后开始广泛使用。1969年，Minsky因在人工智能方面的贡献而获得计算机图灵奖。2024/5/25 周六5强化学习史忠植引言引言o1953到1957年，Bellman提出了求解最优控制问题的一个有效方法：动态规划（d

5、ynamic programming）oBellman于 1957年还提出了最优控制问题的随机离散版本，就是著名的马尔可夫决策过程（MDP,Markov decision processe），1960年Howard提出马尔可夫决策过程的策略迭代方法，这些都成为现代强化学习的理论基础。o1972年，Klopf把试错学习和时序差分结合在一起。1978年开始，Sutton、Barto、Moore，包括Klopf等对这两者结合开始进行深入研究。o1989年Watkins提出了Q-学习Watkins 1989，也把强化学习的三条主线扭在了一起。o1992年，Tesauro用强化学习成功了应用到西洋双陆棋

6、（backgammon）中，称为TD-Gammon。2024/5/25 周六6强化学习史忠植内容提要l引言引言l强化学习模型强化学习模型l动态规划动态规划l蒙特卡罗方法蒙特卡罗方法l时序差分学习时序差分学习lQ学习学习l强化学习中的函数估计强化学习中的函数估计l应用应用2024/5/25 周六7强化学习史忠植主体主体主体主体强化学习模型i:inputr:reward s:statea:action状态 sisi+1ri+1奖励 ri环境环境环境环境动作动作 aia0a1a2s0s1s2s32024/5/25 周六8强化学习史忠植描述一个环境描述一个环境（问题）oAccessible vs

7、.inaccessibleoDeterministic vs.non-deterministicoEpisodic vs.non-episodicoStatic vs.dynamicoDiscrete vs.continuousThe most complex general class of environments are inaccessible,non-deterministic,non-episodic,dynamic,and continuous.2024/5/25 周六9强化学习史忠植强化学习问题强化学习问题oAgent-environment interactionnStat

8、es,Actions,RewardsoTo define a finite MDPnstate and action sets:S and Anone-step“dynamics”defined by transition probabilities(Markov Property):nreward probabilities:EnvironmentactionstaterewardRLAgent2024/5/25 周六10强化学习史忠植与监督学习对比与监督学习对比oReinforcement Learning Learn from interactionnlearn from its ow

9、n experience,and the objective is to get as much reward as possible.The learner is not told which actions to take,but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs(“actions”)Training Info =evaluations(“rewards”/“penalties”)oSupervised Learning Learn f

10、rom examples provided by a knowledgable external supervisor.2024/5/25 周六11强化学习史忠植强化学习要素强化学习要素oPolicy:stochastic rule for selecting actionsoReturn/Reward:the function of future rewards agent tries to maximizeoValue:what is good because it predicts rewardoModel:what follows whatPolicyRewardValueModel

11、 ofenvironmentIs unknownIs my goalIs I can getIs my method2024/5/25 周六12强化学习史忠植在策略在策略下的下的BellmanBellman公式公式The basic idea:So:Or,without the expectation operator:is the discount rate2024/5/25 周六13强化学习史忠植Bellman最优策略公式2024/5/25 周六14强化学习史忠植MARKOV DECISION PROCESS k-armed bandit gives immediate reward

12、 DELAYED REWARD?Characteristics of MDP:a set of states :Sa set of actions:Aa reward function:R:S x A RA state transition function:T:S x A (S)T(s,a,s):probability of transition from s to s using action a2024/5/25 周六15强化学习史忠植MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy pol

13、icy selection)2024/5/25 周六16强化学习史忠植MDP Graphical Representation,:T(s,action,s )Similarity to Hidden Markov Models(HMMs)2024/5/25 周六17强化学习史忠植动态规划Dynamic Programming-ProblemoA discrete-time dynamic systemnStates 1,n+termination state 0nControl U(i)nTransition Probability pij(u)oAccumulative cost str

14、uctureoPolicies2024/5/25 周六18强化学习史忠植oFinite Horizon ProblemoInfinite Horizon ProblemoValue Iteration动态规划Dynamic Programming Iterative Solution 2024/5/25 周六19强化学习史忠植动态规划中的策略迭代/值迭代 policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration2024/5/25 周六20强化学习史忠植动态规划方法TTTTTTTTT

15、TTTT2024/5/25 周六21强化学习史忠植自适应动态规划(ADP)Idea:use the constraints(state transition probabilities)between states to speed learning.Solve=value determination.No maximization over actions because agent is passive unlike in value iteration.using DPLarge state spacee.g.Backgammon:1050 equations in 1050 vari

16、ables2024/5/25 周六22强化学习史忠植Value Iteration AlgorithmAN ALTERNATIVE ITERATION:(Singh,1993)(Important for model free learning)Stop Iteration when V(s)differs less than.Policy difference ratio=2/(1-)(Williams&Baird 1993b)2024/5/25 周六23强化学习史忠植Policy Iteration Algorithm Policies converge faster than val

17、ues.Why faster convergence?2024/5/25 周六24强化学习史忠植Reinforcement Learning Deterministic transitionsStochastic transitionsis the probability to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost=0.04(

18、Temporal)credit assignment problem sparse reinforcement problemOffline alg:action sequences determined ex anteOnline alg:action sequences is conditional on observations along the way;Important in stochastic environment(e.g.jet flying)2024/5/25 周六25强化学习史忠植Reinforcement Learning M=0.8 in direction yo

19、u want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy:mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable(accessible):percept identifies the statePartial

20、ly observableMarkov property:Transition probabilities depend on state only,not on the path to the state.Markov decision problem(MDP).Partially observable MDP(POMDP):percepts does not have enough info to identify transition probabilities.2024/5/25 周六26强化学习史忠植Model Free MethodsModels of the environme

21、nt:T:S x A (S)and R:S x A RDo we know them?Do we have to know them?oMonte Carlo MethodsoAdaptive Heuristic CriticoQ Learning2024/5/25 周六27强化学习史忠植Monte Carlo策略策略评价评价oGoal:learn Vp p(s)under P and R are unknown in advanceoGiven:some number of episodes under p p which contain soIdea:Average returns ob

22、served after visits to soEvery-Visit MC:average returns for every time s is visited in an episodeoFirst-visit MC:average returns only for first time s is visited in an episodeoBoth converge asymptotically123452024/5/25 周六28强化学习史忠植蒙特卡罗方法 Monte Carlo Methods oIdea:Hold statistics about rewards for ea

23、ch state Take the average This is the V(s)oBased only on experience oAssumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.)oIncremental in episode-by-episode sense not step-by-step sense.2024/5/25 周六29强化学习史忠植Problem:Unvisited

24、 pairs(problem of maintaining exploration)For every make sure that:P(selected as a start state and action)0 (Assumption of exploring starts )蒙特卡罗方法蒙特卡罗方法 2024/5/25 周六30强化学习史忠植Monte Carlo方法TTTTTTTTTTTTTTTTTTTT2024/5/25 周六31强化学习史忠植蒙特卡罗控制蒙特卡罗控制How to select Policies:(Similar to policy evaluation)MC p

25、olicy iteration:Policy evaluation using MC methods followed by policy improvement Policy improvement step:greedify with respect to value(or action-value)function2024/5/25 周六32强化学习史忠植时序差分学习时序差分学习 Temporal-Differencetarget:the actual return after time ttarget:an estimate of the return2024/5/25 周六33强化

26、学习史忠植时序差分学习时序差分学习(TD)Idea:Do ADP backups on a per move basis,not for the whole state space.Theorem:Average value of U(i)converges to the correct value.Theorem:If is appropriately decreased as a function of times a state is visited(=Ni),then U(i)itself converges to the correct value2024/5/25 周六34强化学

27、习史忠植时序差分学习时序差分学习 TDTTTTTTTTTTTTTTTTTTTT2024/5/25 周六35强化学习史忠植TD()A Forward ViewoTD()is a method for averaging all n-step backups nweight by n-1(time since visitation)n-return:oBackup using-return:2024/5/25 周六36强化学习史忠植时序差分学习算法时序差分学习算法 TD()Idea:update from the whole epoch,not just on state transitio

28、n.Special cases:=1:Least-mean-square(LMS),Mont Carlo=0:TDIntermediate choice of (between 0 and 1)is best.Interplay with 2024/5/25 周六37强化学习史忠植时序差分学习算法时序差分学习算法2024/5/25 周六38强化学习史忠植时序差分学习算法收敛性时序差分学习算法收敛性TD()Theorem:Converges w.p.1 under certain boundaries conditions.Decrease i(t)s.t.In practice,often

29、 a fixed is used for all i and t.2024/5/25 周六39强化学习史忠植时序差分学习时序差分学习 TD2024/5/25 周六40强化学习史忠植Q-LearningWatkins,1989oEstimate the Q-function using some approximator (for example,linear regression or neural networks or decision trees etc.).oDerive the estimated policy as an argument of the maximum of t

30、he estimated Q-function.oAllow different parameter vectors at different time points.oLet us illustrate the algorithm with linear regression as the approximator,and of course,squared error as the appropriate loss function.2024/5/25 周六41强化学习史忠植Q-learningQ(a,i)Direct approach(ADP)would require learnin

31、g a model .Q-learning does not:Do this update after each state transition:2024/5/25 周六42强化学习史忠植ExplorationTradeoff between exploitation(control)and exploration(identification)Extremes:greedy vs.random acting(n-armed bandit models)Q-learning converges to optimal Q-values if*Every state is visited in

32、finitely often(due to exploration),*The action selection becomes greedy as time approaches infinity,and*The learning rate is decreased fast enough but not too fast(as we discussed in TD learning)2024/5/25 周六43强化学习史忠植Common exploration methods1.In value iteration in an ADP agent:Optimistic estimate

33、of utility U+(i)2.-greedy methodNongreedy actions Greedy action3.Boltzmann explorationExploration funcR+if nNu o.w.2024/5/25 周六44强化学习史忠植Q-Learning AlgorithmoSetoForoThe estimated policy satisfies2024/5/25 周六45强化学习史忠植What is the intuition?oBellman equation gives oIf and the training set were infini

34、te,then Q-learning minimizes which is equivalent to minimizing2024/5/25 周六46强化学习史忠植A-Learning Murphy,2003 and Robins,2004oEstimate the A-function(advantages)using some approximator,as in Q-learning.oDerive the estimated policy as an argument of the maximum of the estimated A-function.oAllow differe

35、nt parameter vectors at different time points.oLet us illustrate the algorithm with linear regression as the approximator,and of course,squared error as the appropriate loss function.2024/5/25 周六47强化学习史忠植A-Learning Algorithm(Inefficient Version)oForoThe estimated policy satisfies2024/5/25 周六48强化学习

36、史忠植Differences between Q and A-learningoQ-learningnAt time t we model the main effects of the history,(St,At-1)and the action At and their interactionnOur Yt-1 is affected by how we modeled the main effect of the history in time t,(St,At-1)oA-learningnAt time t we only model the effects of At and it

37、s interaction with(St,At-1)nOur Yt-1 does not depend on a model of the main effect of the history in time t,(St,At-1)2024/5/25 周六49强化学习史忠植Q-Learning Vs.A-LearningoRelative merits and demerits are not completely known till now.oQ-learning has low variance but high bias.oA-learning has high variance

38、but low bias.oComparison of Q-learning with A-learning involves a bias-variance trade-off.2024/5/25 周六50强化学习史忠植POMDP部分感知马氏决策过程 oRather than observing the state we observe some function of the state.oOb Observable functiona random variable for each states.oProblem:different states may look similarTh

39、e optimal strategy might need to consider the history.2024/5/25 周六51强化学习史忠植Framework of POMDPPOMDP由六元组定义。其中定义了环境潜在的马尔可夫决策模型上，是观察的集合，即系统可以感知的世界状态集合，观察函数：SAPD（）。系统在采取动作a转移到状态s时，观察函数确定其在可能观察上的概率分布。记为（s,a,o）。1 可以是S的子集，也可以与S无关2024/5/25 周六52强化学习史忠植POMDPsWhat if state information(from sensors)is noisy?Mo

40、stly the case!MDP techniques are suboptimal!Two halls are not the same.2024/5/25 周六53强化学习史忠植POMDPs A Solution StrategySE:Belief State Estimator(Can be based on HMM):MDP Techniques2024/5/25 周六54强化学习史忠植POMDP_信度状态方法oIdea:Given a history of actions and observable value,we compute a posterior distribut

41、ion for the state we are in(belief state)oThe belief-state MDPnStates:distribution over S(states of the POMDP)nActions:as in POMDPnTransition:the posterior distribution(given the observation)Open Problem:How to deal with the continuous distribution?2024/5/25 周六55强化学习史忠植The Learning Process of Belie

42、f MDP2024/5/25 周六56强化学习史忠植Major Methods to Solve POMDP 算法名称基本思想学习值函数Memoryless policies直接采用直接采用标标准的准的强强化学化学习习算法算法Simple memory based approaches使用使用k个个历历史史观观察表示当前状察表示当前状态态UDM(Utile Distinction Memory)分解状分解状态态，构建有限状，构建有限状态态机模型机模型NSM(Nearest Sequence Memory)存存储储状状态历态历史，史，进进行距离度量行距离度量USM(Utile Suffix M

43、emory)综综合合UDM和和NSM两种方法两种方法Recurrent-Q使用循使用循环环神神经经网网络进络进行状行状态预测态预测策略搜索Evolutionary algorithms使用使用遗传遗传算法直接算法直接进进行策略搜索行策略搜索Gradient ascent method使用梯度下降（上升）法搜索使用梯度下降（上升）法搜索2024/5/25 周六57强化学习史忠植强化学习中的函数估计RLFASubset of statesValue estimate as targetsV(s)Generalization of the value function to the entire

44、state spaceis the TD operator.is the function approximation operator.2024/5/25 周六58强化学习史忠植并行两个迭代过程o值函数迭代过程o值函数逼近过程How to construct the M function?Using state cluster,interpolation,decision tree or neural network?2024/5/25 周六59强化学习史忠植oFunction Approximator:V(s)=f(s,w)oUpdate:Gradient-descent Sarsa:

45、w w+a rt+1+g Q(st+1,at+1)-Q(st,at)w f(st,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem:How to design the non-liner FA system which can converge with the incremental instances?2024/5/25 周六60强化学习史忠植Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInt

46、erval-dependent discountDiscrete timeDiscrete eventsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level.One approach to Temporal Hierarchical RL2024/5/25 周六61强化学习史忠植The equations2024/5/25 周六62强化学习史忠植Multi-agent MDPoDistributed RLoMarkov GameoBest Respo

47、nseEnvironmentactionstaterewardRLAgentRLAgent2024/5/25 周六63强化学习史忠植三种观点三种观点问题空间主要方法算法准则合作多agent强化学习分布、同构、分布、同构、合作合作环环境境交交换换状状态态提高学提高学习习收收敛敛速度速度交交换经验换经验交交换换策略策略交交换换建建议议基于平衡解多agent强化学习同构或异构、同构或异构、合作或合作或竞竞争争环环境境极小极大极小极大-Q-Q理性和收理性和收敛敛性性NASH-QNASH-QCE-QCE-QWoLFWoLF最佳响应多agent强化学习异构、异构、竞竞争争环环境境PHCPHC收收敛敛性和

48、不性和不遗遗憾性憾性IGAIGAGIGAGIGAGIGA-WoLFGIGA-WoLF2024/5/25 周六64强化学习史忠植马尔可夫对策马尔可夫对策o在在n个个agent的系统中，定义离散的状态集的系统中，定义离散的状态集S（即对策集合（即对策集合G），），agent动作动作集集Ai的集合的集合A,联合奖赏函数联合奖赏函数Ri：SA1An 和状态转移函数和状态转移函数P：SA1AnPD（S）。2024/5/25 周六65强化学习史忠植基于平衡解方法的强化学习基于平衡解方法的强化学习Open Problem:Nash equilibrium or other equilibrium is

49、enough?The optimal policy in single game is Nash equilibrium.2024/5/25 周六66强化学习史忠植Applications of RLoCheckers Samuel 59oTD-Gammon Tesauro 92oWorlds best downpeak elevator dispatcher Crites at al 95oInventory management Bertsekas et al 95n10-15%better than industry standardoDynamic channel assignmen

50、t Singh&Bertsekas,Nie&Haykin 95nOutperforms best heuristics in the literatureoCart-pole Michie&Chambers 68-with bang-bang controloRobotic manipulation Grupen et al.93-oPath planningoRobot docking Lin 93oParkingoFootball Stone98oTetrisoMultiagent RL Tan 93,Sandholm&Crites 95,Sen 94-,Carmel&Markovitch

下载提示：咨信网仅提供存储空间/不修改/不编辑

【自信AI创作助手】【自信AI导航】
1、请仔细预览页面，基本判断完整性，对于直接下载带来的问题请及时与客服沟通；下载的文档，不会出现我们的网址水印。
2、该文档所得收入（下载+内容+预览）归上传者、原创作者；如果您是本文档原作者，请点此认领！既往收益都归您。

同意并开始全文预览

举报此文档有问题？有机会获“体验VIP”奖励！

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

6 金币 0人已下载

申诉本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请申请举报、认领或删除 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 强化学习 ppt 文档编辑修改

咨信网温馨提示：
1、咨信平台为文档C2C交易模式，即用户上传的文档直接被用户下载，收益归上传人（含作者）所有；本站仅是提供信息存储空间和展示预览，仅对用户上传内容的表现方式做保护处理，对上载内容不做任何修改或编辑。所展示的作品文档包括内容和图片全部来源于网络用户和作者上传投稿，我们不确定上传用户享有完全著作权，根据《信息网络传播权保护条例》，如果侵犯了您的版权、权益或隐私，请联系我们，核实后会尽快下架及时删除，并可随时和客服了解处理情况，尊重保护知识产权我们共同努力。
2、文档的总页数、文档格式和文档大小以系统显示为准(内容中显示的页数不一定正确)，网站客服只以系统显示的页数、文件格式、文档大小作为仲裁依据，平台无法对文档的真实性、完整性、权威性、准确性、专业性及其观点立场做任何保证或承诺，下载前须认真查看，确认无误后再购买，务必慎重购买；若有违法违纪将进行移交司法处理，若涉侵权平台将进行基本处罚并下架。
3、本站所有内容均由用户上传，付费前请自行鉴别，如您付费，意味着您已接受本站规则且自行承担风险，本站不进行额外附加服务，虚拟产品一经售出概不退款（未进行购买下载可退充值款），文档一经付费（服务费）、不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
4、如你看到网页展示的文档有www.zixin.com.cn水印，是因预览和防盗链等技术需要对页面进行转换压缩成图而已，我们并不对上传的文档进行任何编辑或修改，文档下载后都不会有水印标识（原文档上传前个别存留的除外），下载后原文更清晰；试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓；PPT和DOC文档可被视为“模板”，允许上传人保留章节、目录结构的情况下删减部份的内容；PDF文档不管是原文档转换或图片扫描而得，本站不作要求视为允许，下载前自行私信或留言给上传者【快乐****生活】。
5、本文档所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用；网站提供的党政主题相关内容(国旗、国徽、党徽－－等)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
6、文档遇到问题，请及时私信或留言给本站上传会员【快乐****生活】，需本站解决可联系【微信客服】、【 QQ客服】，若有其他问题请点击或扫码反馈【服务填表】；文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“【版权申诉】”（推荐），意见反馈和侵权处理邮箱：1219186828@qq.com；也可以拔打客服电话：4008-655-100；投诉/维权电话：4009-655-100。