Reinforcement Learning An Introduction

%%

{"created":"2022-12-04T15:42:22.580Z","text":"the two most important distinguishing features of reinforcement learning","updated":"2022-12-04T15:42:22.580Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","target":[{"source":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":45747,"end":45788},{"type":"TextQuoteSelector","exact":"trial-and-error search and delayed reward","prefix":"wards. These twocharacteristics—","suffix":"—are the two most importantdisti"}]}]}
%% %%PREFIX%%wards. These twocharacteristics—%%HIGHLIGHT%% ==trial-and-error search and delayed reward== %%POSTFIX%%—are the two most importantdisti %%LINK%%[[#^cq1btvdxbba|show annotation]] %%COMMENT%% the two most important distinguishing features of reinforcement learning %%TAGS%%

^cq1btvdxbba

%%

{"created":"2022-12-04T15:52:40.928Z","text":"the differences with the supervised learning and the unsupervised learning","updated":"2022-12-04T15:52:40.928Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","target":[{"source":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":47368,"end":49373},{"type":"TextQuoteSelector","exact":"Reinforcement learning is di↵erent from supervised learning, the kind of learning studiedin most current research in the field of machine learning. Supervised learning is learningfrom a training set of labeled examples provided by a knowledgable external supervisor.Each example is a description of a situation together with a specification—the label—ofthe correct action the system should take in that situation, which is often to identify acategory to which the situation belongs. The object of this kind of learning is for thesystem to extrapolate, or generalize, its responses so that it acts correctly in situationsnot present in the training set. This is an important kind of learning, but alone it is notadequate for learning from interaction. In interactive problems it is often impractical toobtain examples of desired behavior that are both correct and representative of all thesituations in which the agent has to act. In uncharted territory—where one would expectlearning to be most beneficial—an agent must be able to learn from its own experience.Reinforcement learning is also di↵erent from what machine learning researchers callunsupervised learning, which is typically about finding structure hidden in collections ofunlabeled data. The terms supervised learning and unsupervised learning would seemto exhaustively classify machine learning paradigms, but they do not. Although onemight be tempted to think of reinforcement learning as a kind of unsupervised learningbecause it does not rely on examples of correct behavior, reinforcement learning is tryingto maximize a reward signal instead of trying to find hidden structure. Uncoveringstructure in an agent’s experience can certainly be useful in reinforcement learning, but byitself does not address the reinforcement learning problem of maximizing a reward signal.We therefore consider reinforcement learning to be a third machine learning paradigm,alongside supervised learning and unsupervised learning and perhaps other paradigms","prefix":"a reinforcement learning method.","suffix":".1.1. Reinforcement Learning 3On"}]}]}
%% %%PREFIX%%a reinforcement learning method.%%HIGHLIGHT%% ==Reinforcement learning is di↵erent from supervised learning, the kind of learning studiedin most current research in the field of machine learning. Supervised learning is learningfrom a training set of labeled examples provided by a knowledgable external supervisor.Each example is a description of a situation together with a specification—the label—ofthe correct action the system should take in that situation, which is often to identify acategory to which the situation belongs. The object of this kind of learning is for thesystem to extrapolate, or generalize, its responses so that it acts correctly in situationsnot present in the training set. This is an important kind of learning, but alone it is notadequate for learning from interaction. In interactive problems it is often impractical toobtain examples of desired behavior that are both correct and representative of all thesituations in which the agent has to act. In uncharted territory—where one would expectlearning to be most beneficial—an agent must be able to learn from its own experience.Reinforcement learning is also di↵erent from what machine learning researchers callunsupervised learning, which is typically about finding structure hidden in collections ofunlabeled data. The terms supervised learning and unsupervised learning would seemto exhaustively classify machine learning paradigms, but they do not. Although onemight be tempted to think of reinforcement learning as a kind of unsupervised learningbecause it does not rely on examples of correct behavior, reinforcement learning is tryingto maximize a reward signal instead of trying to find hidden structure. Uncoveringstructure in an agent’s experience can certainly be useful in reinforcement learning, but byitself does not address the reinforcement learning problem of maximizing a reward signal.We therefore consider reinforcement learning to be a third machine learning paradigm,alongside supervised learning and unsupervised learning and perhaps other paradigms== %%POSTFIX%%.1.1. Reinforcement Learning 3On %%LINK%%[[#^3a6mjdvi6vm|show annotation]] %%COMMENT%% the differences with the supervised learning and the unsupervised learning %%TAGS%%

^3a6mjdvi6vm

%%

{"created":"2022-12-04T15:55:06.705Z","text":"the difference between the reinforcement learning and the machine learing(maybe unsupervised learning)\n强化学习试图最大化奖励信号而不是试图找到隐藏的结构","updated":"2022-12-04T15:55:06.705Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","target":[{"source":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":48910,"end":49012},{"type":"TextQuoteSelector","exact":"reinforcement learning is tryingto maximize a reward signal instead of trying to find hidden structure","prefix":"n examples of correct behavior, ","suffix":". Uncoveringstructure in an agen"}]}]}
%% %%PREFIX%%n examples of correct behavior,%%HIGHLIGHT%% ==reinforcement learning is tryingto maximize a reward signal instead of trying to find hidden structure== %%POSTFIX%%. Uncoveringstructure in an agen %%LINK%%[[#^qxiv7sf3cvn|show annotation]] %%COMMENT%% the difference between the reinforcement learning and the machine learing(maybe unsupervised learning) 强化学习试图最大化奖励信号而不是试图找到隐藏的结构 %%TAGS%%

^qxiv7sf3cvn

%%

{"created":"2022-12-05T15:46:17.382Z","updated":"2022-12-05T15:46:17.382Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","target":[{"source":"vault:/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0Reinforce_Learning/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":53899,"end":54144},{"type":"TextQuoteSelector","exact":" Of all theforms of machine learning, reinforcement learning is the closest to the kind of learningthat humans and other animals do, and many of the core algorithms of reinforcementlearning were originally inspired by biological learning systems","prefix":"antial benefits going both ways.","suffix":". Reinforcement learninghas also"}]}]}
%% %%PREFIX%%antial benefits going both ways.%%HIGHLIGHT%% ==Of all theforms of machine learning, reinforcement learning is the closest to the kind of learningthat humans and other animals do, and many of the core algorithms of reinforcementlearning were originally inspired by biological learning systems== %%POSTFIX%%. Reinforcement learninghas also %%LINK%%[[#^213poqire0h|show annotation]] %%COMMENT%%

%%TAGS%%

^213poqire0h

%%

{"created":"2022-12-12T15:19:35.760Z","updated":"2022-12-12T15:19:35.760Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/pdfs/RL/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/pdfs/RL/RLbook2020.pdf","target":[{"source":"vault:/pdfs/RL/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":59976,"end":60122},{"type":"TextQuoteSelector","exact":" four main subelements of areinforcement learning system: a policy, a reward signal, a value function, and, optionally,a model of the environment.","prefix":"he environment, one can identify","suffix":"A policy defines the learning ag"}]}]}
%% %%PREFIX%%he environment, one can identify%%HIGHLIGHT%% ==four main subelements of areinforcement learning system: a policy, a reward signal, a value function, and, optionally,a model of the environment.== %%POSTFIX%%A policy defines the learning ag %%LINK%%[[#^nwwduwpdd5b|show annotation]] %%COMMENT%%

%%TAGS%%

^nwwduwpdd5b

%%

{"created":"2022-12-12T15:19:48.463Z","updated":"2022-12-12T15:19:48.463Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/pdfs/RL/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/pdfs/RL/RLbook2020.pdf","target":[{"source":"vault:/pdfs/RL/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":60122,"end":60190},{"type":"TextQuoteSelector","exact":"A policy defines the learning agent’s way of behaving at a given tim","prefix":"ally,a model of the environment.","suffix":"e. Roughly speaking,a policy is "}]}]}
%% %%PREFIX%%ally,a model of the environment.%%HIGHLIGHT%% ==A policy defines the learning agent’s way of behaving at a given tim== %%POSTFIX%%e. Roughly speaking,a policy is %%LINK%%[[#^claybst3l74|show annotation]] %%COMMENT%%

%%TAGS%%

^claybst3l74

%%

{"created":"2022-12-12T15:19:53.115Z","text":"即时的奖励","updated":"2022-12-12T15:19:53.115Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/pdfs/RL/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/pdfs/RL/RLbook2020.pdf","target":[{"source":"vault:/pdfs/RL/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":60763,"end":60832},{"type":"TextQuoteSelector","exact":"A reward signal defines the goal of a reinforcement learning problem.","prefix":"ngprobabilities for each action.","suffix":" On each timestep, the environme"}]}]}
%% %%PREFIX%%ngprobabilities for each action.%%HIGHLIGHT%% ==A reward signal defines the goal of a reinforcement learning problem.== %%POSTFIX%%On each timestep, the environme %%LINK%%[[#^chkka94n6jj|show annotation]] %%COMMENT%% 即时的奖励 %%TAGS%%

^chkka94n6jj

%%

{"created":"2022-12-12T16:02:32.574Z","text":"期望奖励","updated":"2022-12-12T16:02:32.574Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/pdfs/RL/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/pdfs/RL/RLbook2020.pdf","target":[{"source":"vault:/pdfs/RL/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":61688,"end":61742},{"type":"TextQuoteSelector","exact":"a valuefunction specifies what is good in the long run","prefix":" is good in an immediate sense, ","suffix":". Roughly speaking, the value of"}]}]}
%% %%PREFIX%%is good in an immediate sense,%%HIGHLIGHT%% ==a valuefunction specifies what is good in the long run== %%POSTFIX%%. Roughly speaking, the value of %%LINK%%[[#^gdz7h9l919|show annotation]] %%COMMENT%% 期望奖励 %%TAGS%%

^gdz7h9l919

%%

{"created":"2022-12-12T16:05:00.485Z","updated":"2022-12-12T16:05:00.485Z","document":{"title":"RLbook2020.pdf","link":[{"href":"urn:x-pdf:8084268f8a7b70a39b0c37683e337727"},{"href":"vault:/pdfs/RL/RLbook2020.pdf"}],"documentFingerprint":"8084268f8a7b70a39b0c37683e337727"},"uri":"vault:/pdfs/RL/RLbook2020.pdf","target":[{"source":"vault:/pdfs/RL/RLbook2020.pdf","selector":[{"type":"TextPositionSelector","start":62698,"end":62760},{"type":"TextQuoteSelector","exact":"the only purpose of estimating values is toachieve more reward","prefix":"s there could be no values, and ","suffix":". Nevertheless, it is values wit"}]}]}
%% %%PREFIX%%s there could be no values, and%%HIGHLIGHT%% ==the only purpose of estimating values is toachieve more reward== %%POSTFIX%%. Nevertheless, it is values wit %%LINK%%[[#^1ej7beapexh|show annotation]] %%COMMENT%%

%%TAGS%%

^1ej7beapexh