强化学习的数学原理（英文版）赵世钰清华大学出版社豆瓣PDF电子书bt网盘迅雷下载科学技术-自然科学-自然科普-霍普软件下载网

Contents

Overview of this Book\t1
Chapter 1 Basic Concepts \t6
1.1 A grid world example \t7
1.2 State and action\t8
1.3 State transition \t9
1.4 Policy \t11
1.5 Reward\t13
1.6 Trajectories, returns, and episodes \t15
1.7 Markov decision processes\t18
1.8 Summary\t20
1.9 Q&A\t20
Chapter 2 State Values and the Bellman Equation \t21
2.1 Motivating example 1: Why are returns important?\t23
2.2 Motivating example 2: How to calculate returns? \t24
2.3 State values\t26
2.4 The Bellman equation \t27
2.5 Examples for illustrating the Bellman equation \t30
2.6 Matrix-vector form of the Bellman equation\t33
2.7 Solving state values from the Bellman equation\t35
2.7.1 Closed-form solution \t35
2.7.2 Iterative solution\t35
2.7.3 Illustrative examples \t36
2.8 From state value to action value\t38
2.8.1 Illustrative examples \t39
2.8.2 The Bellman equation in terms of action values\t40
2.9 Summary\t41
2.10 Q&A \t42
Chapter 3 Optimal State Values and the Bellman Optimality Equation\t43
3.1 Motivating example: How to improve policies? \t45
3.2 Optimal state values and optimal policies\t46
3.3 The Bellman optimality equation\t47
3.3.1 Maximization of the right-hand side of the BOE \t48
3.3.2 Matrix-vector form of the BOE\t49
3.3.3 Contraction mapping theorem \t50
3.3.4 Contraction property of the right-hand side of the BOE \t53
3.4 Solving an optimal policy from the BOE \t55
3.5 Factors that influence optimal policies\t58
3.6 Summary\t63
3.7 Q&A\t63
Chapter 4 Value Iteration and Policy Iteration\t66
4.1 Value iteration \t68
4.1.1 Elementwise form and implementation \t68
4.1.2 Illustrative examples \t70
4.2 Policy iteration\t72
4.2.1 Algorithm analysis\t73
4.2.2 Elementwise form and implementation \t76
4.2.3 Illustrative examples \t77
4.3 Truncated policy iteration\t81
4.3.1 Comparing value iteration and policy iteration \t81
4.3.2 Truncated policy iteration algorithm \t83
4.4 Summary\t85
4.5 Q&A\t86
Chapter 5 Monte Carlo Methods\t89
5.1 Motivating example: Mean estimation\t91
5.2 MC Basic: The simplest MC-based algorithm\t93
5.2.1 Converting policy iteration to be model-free\t93
5.2.2 The MC Basic algorithm\t94
5.2.3 Illustrative examples \t96
5.3 MC Exploring Starts \t99
5.3.1 Utilizing samples more efficiently \t100
5.3.2 Updating policies more efficiently \t101
5.3.3 Algorithm description\t101
5.4 MC -Greedy: Learning without exploring starts\t102
5.4.1 -greedy policies\t103
5.4.2 Algorithm description\t103
5.4.3 Illustrative examples\t105
5.5 Exploration and exploitation of -greedy policies\t106
5.6 Summary \t111
5.7 Q&A \t111
Chapter 6 Stochastic Approximation\t114
6.1 Motivating example: Mean estimation\t116
6.2 Robbins-Monro algorithm \t117
6.2.1 Convergence properties \t119
6.2.2 Application to mean estimation \t123
6.3 Dvoretzky's convergence theorem \t124
6.3.1 Proof of Dvoretzky's theorem \t125
6.3.2 Application to mean estimation.\t126
6.3.3 Application to the Robbins-Monro theorem \t127
6.3.4 An extension of Dvoretzky's theorem \t127
6.4 Stochastic gradient descent \t128
6.4.1 Application to mean estimation\t130
6.4.2 Convergence pattern of SGD\t131
6.4.3 A deterministic formulation of SGD\t133
6.4.4 BGD, SGD, and mini-batch GD\t134
6.4.5 Convergence of SGD\t136
6.5 Summary \t138
6.6 Q&A \t138
Chapter 7 Temporal-Difference Methods\t140
7.1 TD learning of state values\t142
7.1.1 Algorithm description\t142
7.1.2 Property analysis \t144
7.1.3 Convergence analysis \t146
7.2 TD learning of action values: Sarsa \t149
7.2.1 Algorithm description\t149
7.2.2 Optimal policy learning via Sarsa \t151
7.3 TD learning of action values: n-step Sarsa\t154
7.4 TD learning of optimal action values: Q-learning\t156
7.4.1 Algorithm description\t156
7.4.2 Off-policy vs. on-policy \t158
7.4.3 Implementation\t160
7.4.4 Illustrative examples\t161
7.5 A unified viewpoint \t165
7.6 Summary \t165
7.7 Q&A \t166
Chapter 8 Value Function Approximation\t168
8.1 Value representation: From table to function\t170
8.2 TD learning of state values with function approximation\t174
8.2.1 Objective function\t174
8.2.2 Optimization algorithms\t180
8.2.3 Selection of function approximators \t182
8.2.4 Illustrative examples\t183
8.2.5 Theoretical analysis\t187
8.3 TD learning of action values with function approximation \t198
8.3.1 Sarsa with function approximation\t198
8.3.2 Q-learning with function approximation\t200
8.4 Deep Q-learning\t201
8.4.1 Algorithm description\t202
8.4.2 Illustrative examples\t204
8.5 Summary \t207
8.6 Q&A \t207
Chapter 9 Policy Gradient Methods\t211
9.1 Policy representation: From table to function \t213
9.2 Metrics for defining optimal policies \t214
9.3 Gradients of the metrics\t219
9.3.1 Derivation of the gradients in the discounted case \t221
9.3.2 Derivation of the gradients in the undiscounted case\t226
9.4 Monte Carlo policy gradient (REINFORCE)\t232
9.5 Summary \t235
9.6 Q&A \t235
Chapter 10 Actor-Critic Methods \t237
10.1 The simplest actor-critic algorithm (QAC) \t239
10.2 Advantage actor-critic (A2C)\t240
10.2.1 Baseline invariance\t240
10.2.2 Algorithm description \t243
10.3 Off-policy actor-critic\t244
10.3.1 Importance sampling\t245
10.3.2 The off-policy policy gradient theorem \t247
10.3.3 Algorithm description \t249
10.4 Deterministic actor-critic\t251
10.4.1 The deterministic policy gradient theorem \t251
10.4.2 Algorithm description \t258
10.5 Summary\t259
10.6 Q&A\t260
Appendix A Preliminaries for Probability Theory\t262
Appendix B Measure-Theoretic Probability Theory\t268
Appendix C Convergence of Sequences \t276
C.1 Convergence of deterministic sequences \t277
C.2 Convergence of stochastic sequences\t280
Appendix D Preliminaries for Gradient Descent\t284
Bibliography \t290
Symbols\t297
Index \t299

书名	强化学习的数学原理（英文版）
分类	科学技术-自然科学-自然科普
作者	赵世钰
出版社	清华大学出版社
下载
简介	内容推荐 "本书从强化学习最基本的概念开始介绍, 将介绍基础的分析工具, 包括贝尔曼公式和贝尔曼最优公式, 然后推广到基于模型的和无模型的强化学习算法, 最后推广到基于函数逼近的强化学习方法。本书强调从数学的角度引入概念、分析问题、分析算法, 并不强调算法的编程实现。本书不要求读者具备任何关于强化学习的知识背景, 仅要求读者具备一定的概率论和线性代数的知识。如果读者已经具备强化学习的学习基础, 本书可以帮助读者更深入地理解一些问题并提供新的视角。本书面向对强化学习感兴趣的本科生、研究生、研究人员和企业或研究所的从业者。 " 目录 Contents Overview of this Book\t1 Chapter 1 Basic Concepts \t6 1.1 A grid world example \t7 1.2 State and action\t8 1.3 State transition \t9 1.4 Policy \t11 1.5 Reward\t13 1.6 Trajectories, returns, and episodes \t15 1.7 Markov decision processes\t18 1.8 Summary\t20 1.9 Q&A\t20 Chapter 2 State Values and the Bellman Equation \t21 2.1 Motivating example 1: Why are returns important?\t23 2.2 Motivating example 2: How to calculate returns? \t24 2.3 State values\t26 2.4 The Bellman equation \t27 2.5 Examples for illustrating the Bellman equation \t30 2.6 Matrix-vector form of the Bellman equation\t33 2.7 Solving state values from the Bellman equation\t35 2.7.1 Closed-form solution \t35 2.7.2 Iterative solution\t35 2.7.3 Illustrative examples \t36 2.8 From state value to action value\t38 2.8.1 Illustrative examples \t39 2.8.2 The Bellman equation in terms of action values\t40 2.9 Summary\t41 2.10 Q&A \t42 Chapter 3 Optimal State Values and the Bellman Optimality Equation\t43 3.1 Motivating example: How to improve policies? \t45 3.2 Optimal state values and optimal policies\t46 3.3 The Bellman optimality equation\t47 3.3.1 Maximization of the right-hand side of the BOE \t48 3.3.2 Matrix-vector form of the BOE\t49 3.3.3 Contraction mapping theorem \t50 3.3.4 Contraction property of the right-hand side of the BOE \t53 3.4 Solving an optimal policy from the BOE \t55 3.5 Factors that influence optimal policies\t58 3.6 Summary\t63 3.7 Q&A\t63 Chapter 4 Value Iteration and Policy Iteration\t66 4.1 Value iteration \t68 4.1.1 Elementwise form and implementation \t68 4.1.2 Illustrative examples \t70 4.2 Policy iteration\t72 4.2.1 Algorithm analysis\t73 4.2.2 Elementwise form and implementation \t76 4.2.3 Illustrative examples \t77 4.3 Truncated policy iteration\t81 4.3.1 Comparing value iteration and policy iteration \t81 4.3.2 Truncated policy iteration algorithm \t83 4.4 Summary\t85 4.5 Q&A\t86 Chapter 5 Monte Carlo Methods\t89 5.1 Motivating example: Mean estimation\t91 5.2 MC Basic: The simplest MC-based algorithm\t93 5.2.1 Converting policy iteration to be model-free\t93 5.2.2 The MC Basic algorithm\t94 5.2.3 Illustrative examples \t96 5.3 MC Exploring Starts \t99 5.3.1 Utilizing samples more efficiently \t100 5.3.2 Updating policies more efficiently \t101 5.3.3 Algorithm description\t101 5.4 MC -Greedy: Learning without exploring starts\t102 5.4.1 -greedy policies\t103 5.4.2 Algorithm description\t103 5.4.3 Illustrative examples\t105 5.5 Exploration and exploitation of -greedy policies\t106 5.6 Summary \t111 5.7 Q&A \t111 Chapter 6 Stochastic Approximation\t114 6.1 Motivating example: Mean estimation\t116 6.2 Robbins-Monro algorithm \t117 6.2.1 Convergence properties \t119 6.2.2 Application to mean estimation \t123 6.3 Dvoretzky's convergence theorem \t124 6.3.1 Proof of Dvoretzky's theorem \t125 6.3.2 Application to mean estimation.\t126 6.3.3 Application to the Robbins-Monro theorem \t127 6.3.4 An extension of Dvoretzky's theorem \t127 6.4 Stochastic gradient descent \t128 6.4.1 Application to mean estimation\t130 6.4.2 Convergence pattern of SGD\t131 6.4.3 A deterministic formulation of SGD\t133 6.4.4 BGD, SGD, and mini-batch GD\t134 6.4.5 Convergence of SGD\t136 6.5 Summary \t138 6.6 Q&A \t138 Chapter 7 Temporal-Difference Methods\t140 7.1 TD learning of state values\t142 7.1.1 Algorithm description\t142 7.1.2 Property analysis \t144 7.1.3 Convergence analysis \t146 7.2 TD learning of action values: Sarsa \t149 7.2.1 Algorithm description\t149 7.2.2 Optimal policy learning via Sarsa \t151 7.3 TD learning of action values: n-step Sarsa\t154 7.4 TD learning of optimal action values: Q-learning\t156 7.4.1 Algorithm description\t156 7.4.2 Off-policy vs. on-policy \t158 7.4.3 Implementation\t160 7.4.4 Illustrative examples\t161 7.5 A unified viewpoint \t165 7.6 Summary \t165 7.7 Q&A \t166 Chapter 8 Value Function Approximation\t168 8.1 Value representation: From table to function\t170 8.2 TD learning of state values with function approximation\t174 8.2.1 Objective function\t174 8.2.2 Optimization algorithms\t180 8.2.3 Selection of function approximators \t182 8.2.4 Illustrative examples\t183 8.2.5 Theoretical analysis\t187 8.3 TD learning of action values with function approximation \t198 8.3.1 Sarsa with function approximation\t198 8.3.2 Q-learning with function approximation\t200 8.4 Deep Q-learning\t201 8.4.1 Algorithm description\t202 8.4.2 Illustrative examples\t204 8.5 Summary \t207 8.6 Q&A \t207 Chapter 9 Policy Gradient Methods\t211 9.1 Policy representation: From table to function \t213 9.2 Metrics for defining optimal policies \t214 9.3 Gradients of the metrics\t219 9.3.1 Derivation of the gradients in the discounted case \t221 9.3.2 Derivation of the gradients in the undiscounted case\t226 9.4 Monte Carlo policy gradient (REINFORCE)\t232 9.5 Summary \t235 9.6 Q&A \t235 Chapter 10 Actor-Critic Methods \t237 10.1 The simplest actor-critic algorithm (QAC) \t239 10.2 Advantage actor-critic (A2C)\t240 10.2.1 Baseline invariance\t240 10.2.2 Algorithm description \t243 10.3 Off-policy actor-critic\t244 10.3.1 Importance sampling\t245 10.3.2 The off-policy policy gradient theorem \t247 10.3.3 Algorithm description \t249 10.4 Deterministic actor-critic\t251 10.4.1 The deterministic policy gradient theorem \t251 10.4.2 Algorithm description \t258 10.5 Summary\t259 10.6 Q&A\t260 Appendix A Preliminaries for Probability Theory\t262 Appendix B Measure-Theoretic Probability Theory\t268 Appendix C Convergence of Sequences \t276 C.1 Convergence of deterministic sequences \t277 C.2 Convergence of stochastic sequences\t280 Appendix D Preliminaries for Gradient Descent\t284 Bibliography \t290 Symbols\t297 Index \t299
随便看	三世情劫冥王的俊俏调皮妃他的蔷薇重生之一锅乱炖苏城喜欢，大概就是你能治愈我我心匪石，与子成说喜欢，大概就是你能治愈我喜欢，大概就是你能治愈我千江有水千江月听宝斋重生之正统妖修 Tiger＆Bunny—相爱的必然方式染青春落华惜晟繁华落尽有你足矣缘生配如若的101故事无限之灵魂契约书［斗罗］虚无之炎掌纹弑神龙帝半月时光玉昭辞霸道总裁俏秘书水泥价格指数及景气指数软件交通违章随手拍泰剧网天文大师轻健身迷彩虎战术望远镜月亮影视大全隐色夜市三国志13威力加强版登录信息修改器 v3.0 生化危机7六项修改器 v3.0 三国志13威力加强版中日文内存修改器 v3.0 魔法大帝五项修改器 v3.0 天命西游飞天加速辅助 v2.2.2 梦幻之星3两项修改器 v3.0 天命西游大脚插件 v2.2.2 官方版生化危机7原版二项修改器 v3.0 生死狙击逸风辅助 v0.3 52新星安卓模拟器 v2.0.1 unlicensed unlike unlikely unlimited unlisted unload unlock unlooked-for unlovable unloved [BT下载][做自己的光][第26集][WEB-MP4/0.28G][国语配音/中文字幕][1080P][SeeWEB] 剧集 2023 大陆剧情连载 [BT下载][做自己的光][第26集][WEB-MP4/0.84G][国语配音/中文字幕][4K-2160P][H265][SeeWEB] 剧集 2023 大陆剧情连载 [BT下载][安乐传][第26-27集][WEB-MP4/0.59G][国语配音/中文字幕][1080P][Huawei] 剧集 2023 大陆剧情连载 [BT下载][安乐传][第26-27集][WEB-MP4/2.17G][国语配音/中文字幕][4K-2160P][60帧率][H265][Huawei] 剧集 2023 大陆剧情连载 [BT下载][安乐传][第26-27集][WEB-MP4/2.63G][国语配音/中文字幕][1080P][SeeWEB] 剧集 2023 大陆剧情连载 [BT下载][安乐传][第26-27集][WEB-MP4/0.59G][国语配音/中文字幕][1080P][SeeWEB] 剧集 2023 大陆剧情连载 [BT下载][安乐传][第26-27集][WEB-MP4/1.91G][国语配音/中文字幕][4K-2160P][H265][SeeWEB] 剧集 2023 大陆剧情连载 [BT下载][我和我爸的十七岁][第13-14集][WEB-MP4/0.86G][国语配音/中文字幕][4K-2160P][H265][SeeWEB] 剧集 2023 大陆剧情连载 [BT下载][是亲密的你][第21-22集][WEB-MP4/0.32G][国语配音/中文字幕][1080P][Huawei] 剧集 2023 大陆爱情连载 [BT下载][是亲密的你][第21-22集][WEB-MP4/1.16G][国语配音/中文字幕][4K-2160P][H265][Huawei] 剧集 2023 大陆爱情连载艾尔登法环权贵细身剑怎么样-艾尔登法环权贵细身剑属性介绍艾尔登法环拉兹利辉石剑怎么样-艾尔登法环拉兹利辉石剑介绍艾尔登法环战鹰爪形剑怎么样-艾尔登法环战鹰爪形剑详细介绍艾尔登法环刀之首的封印监牢在哪里-刀之首的封印监牢位置一览艾尔登法环尊贵者的英雄墓地在哪里-尊贵者的英雄墓地位置介绍艾尔登法环盖利德英灵地下墓在哪里-盖利德英灵地下墓位置介绍艾尔登法环灵庙原野地下墓地在哪里-灵庙原野地下墓地位置介绍艾尔登法环暗月大剑怎么样-艾尔登法环暗月大剑属性一览艾尔登法环君王军大剑怎么样-艾尔登法环君王军大剑属性一览艾尔登法环混种大剑怎么样-艾尔登法环混种大剑属性一览