亚洲欧美第一页_禁久久精品乱码_粉嫩av一区二区三区免费野_久草精品视频

? 歡迎來到蟲蟲下載站! | ?? 資源下載 ?? 資源專輯 ?? 關于我們
? 蟲蟲下載站

?? ctdlearner.h

?? 強化學習算法(R-Learning)難得的珍貴資料
?? H
字號:
// Copyright (C) 2003
// Gerhard Neumann (gerhard@igi.tu-graz.ac.at)

//                
// This file is part of RL Toolbox.
// http://www.igi.tugraz.at/ril_toolbox
//
// All rights reserved.
// 
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
// are met:
// 1. Redistributions of source code must retain the above copyright
//    notice, this list of conditions and the following disclaimer.
// 2. Redistributions in binary form must reproduce the above copyright
//    notice, this list of conditions and the following disclaimer in the
//    documentation and/or other materials provided with the distribution.
// 3. The name of the author may not be used to endorse or promote products
//    derived from this software without specific prior written permission.
// 
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
// IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
// OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
// IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
// INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
// NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
// THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

#ifndef CABSTRACTRILEARNER_H
#define CABSTRACTRILEARNER_H

#include "cqfunction.h"
#include "cagentlistener.h"
#include "cagentcontroller.h"
#include "caction.h"
#include "cqetraces.h"
#include "cresiduals.h"
#include "ril_debug.h"
#include "cerrorlistener.h"

/// Class for temporal Difference Learning
/**Temporal Difference (TD) Q-Value Learner are the common model-free reinforcement learning algorithms. They make their update according to the difference of the current Q-Value to the calculated Q Value
Q(s_t, a_t)=R(s_t, a_t, s_{t+1})+gamma * Q(s_{t+1}, a_{t+1}) for each step sample. So the TD update for the Q-Values is
Q_new(s_t,a_t)=(1-alpha)*Q_old(s_t,a_t)+alpha*(R(s_t,a_t,s_{t+1})+gamma*Q(s_{t+1},a_{t+1})) is further Q_old(s_t,a_t) + alpha*(R(s_t,a_t, s_{t+1})+gamma*Q(s_{t+1},a_{t+1})-Q(s_t,a_t))
where R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) is the temporal Difference. Respectively for the semi Markov case, the temporal difference
is R(s_t, a_t, s_{t+1}) + gamma^N * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)). This temporal difference update is usually done for states
from the past too, using ETraces. This Method is called TD-Lambda. 
<p>
In the RIL toolbox TD-Learner are represented by the class CTDLearner and provides an implementation of the TD-Lambda algorithm. The class maintains a Q-Function, an ETraces Object, a Reward Function and a Policy as estimation policy needed for the calculation of a_{t+1}
The Q-Function, the Reward Function and the Policy have to be passed from the user. The ETraces object is usually initialized with the standard etraces object for the Q-Function, but can also be specified.. 
<p>
The learnStep Function updates the Q-Function according the step sample. The function is called by the nextStep event. 
First of all the last estimated action ($a_{t+1}$) is compared to the action rlt_really executed. If these two actions are not equal,
the ETraces have to be reset, because the agent didn't follow the policy to learn. If you don't want to reset the etraces you can set the parameter "ResetETracesOnWrongEstimate" to false (0.0). If the 2 actions are equal the Etraces gets multiplied by lambda*gamma.
After that, the Etrace of the current state-action pair is added to the ETraces object, then the next estimated action is calculated by the given policy and stored. Now the temporal difference error can be calculated by R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) or R(s_t, a_t, s_{t+1}) + gamma^N * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) for multi-step actions. Having the temporal difference error all the states in the ETraces are updated by the updateQFunction method from the Q-Etraces object. Before the update, the temporal difference error gets multiplied with the learning rate (Parameter: "QLearningRate").
<p>
The getTemporalDifference function calculates the old Q-Value and the new Q-Value and then calls the getResidual function, which does the actual temporal difference error computation. 
<p>
For hierarchic MDP's Intermediate steps get a special treatment in the TD-Algorithm. Since the intermediate steps aren't rlt_really member of the episode they need special treatment for etraces.
The state of the intermediate step is normally added to the ETraces object, but the multiplication of all other ETraces is canceled and the Q-Function isn抰 updated with the whole ETraces object, only the Q-Value of the intermediate state is updated. 
This is done because the intermediate step isn't directly reachable for the past states and update all intermediate steps via etraces would falsify the Q-Values since the same step gets updates several times.
<p>
CTDLearner has following Parameters:
- inherits all Parameters from the Q-Function
- inherits all Parameters from the ETraces
- "QLearningRate", 0.2 : learning rate of the algorithm
- "DiscountFactor", 0.95 : discount factor of the learning problem
- "ResetETracesOnWrongEstimate", 1.0 : reset etraces when the estimated action wasn't the rlt_real executed.

@see CQLearner
@see CSarsaLearner
*/

class CTDLearner : public CSemiMDPRewardListener, public CErrorSender
{
  protected:

/// use extern eTraces
	bool externETraces;

/// estimation Policy - policy which is learned
	CAgentController *estimationPolicy;

/// The last action estimated by the policy
	CAction *lastEstimatedAction;

	CAbstractQFunction *qfunction;

	CAbstractQETraces *etraces;

	CActionDataSet *actionDataSet;

	/// Updates the Q-Function and manages the Etraces.
/**The learnStep Function updates the Q-Function according the step sample. The function is called by the nextStep event. 
First of all the last estimated action (a_{t+1}) is compared to the action rlt_really executed. If these two actions are not equal,
the ETraces have to be reset, because the agent didn't follow the policy to learn, using the etraces of older states would falsify the Q-Values. If the 2 actions are equal the Etraces gets multiplied by lambda*gamma.
After that, the Etrace of the current state-action pair is added to the ETraces object, then the next estimated action is calculated by the given policy. Now the temporal difference can be calculated by R(s_t, a_t, s_{t+1}) + gamma * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) or 
R(s_t, a_t, s_{t+1}) + gamma^N * Q(s_{t+1}, a_{t+1})- Q(s_t,a_t)) for multi-step actions. Having the temporal difference all the states in the ETraces are updated by
the updateQFunction method from the Q-Etraces object. 
*/
	virtual void learnStep(CStateCollection *oldState, CAction *action, rlt_real reward, CStateCollection *nextState);

	/// calculates the temporal difference
	virtual rlt_real getTemporalDifference(CStateCollection *oldState, CAction *action, rlt_real reward, CStateCollection *nextState);

	/// returns the temporal difference error residual
	virtual rlt_real getResidual(rlt_real oldQ, rlt_real reward, int duration, rlt_real newQ);

	/// adds the current state to the etraces
	virtual void addETraces(CStateCollection *oldState, CStateCollection *newState, CAction *action);

public:
	/// Creates a TD Learner with the given abstract Q-Function and Q-ETraces
    CTDLearner(CRewardFunction *rewardFunction, CAbstractQFunction *qfunction, CAbstractQETraces *etraces, CAgentController *estimationPolicy);		
    /// Creates a TD Learner with the given composed Q-Function and a new composed Q-Etraces object.
	/**
	The etraces get initialised by the standard V-Etraces of the Q-Functions V-Functions. If you want to access the VEtraces you have to cast the result from getQETraces() from (CAbstractQETraces *) to (CQETraces *).
	*/
	CTDLearner(CRewardFunction *rewardFunction, CAbstractQFunction *qfunction, CAgentController *estimationPolicy);		
		
	virtual ~CTDLearner();
 		
	virtual void loadValues(char *filename);
	virtual void saveValues(char *filename);

 	virtual void loadValues(FILE *stream);
	virtual void saveValues(FILE *stream);

/// Calls the update function learnStep
	virtual void nextStep(CStateCollection *oldState, CAction *action, rlt_real reward, CStateCollection *nextState);
/// Updates the Q-Function for a intermediate step
/**Since the intermediate steps aren't rlt_really member of the episode they need special treatment for etraces.
The state of the intermediate step is normally added to the ETraces object, but the multiplication of all other ETraces is canceled and the Q-Function isn抰 updated with the whole ETraces object, only the Q-Value of the intermediate state is updated. 
This is done because the intermediate step isn't directly reachable for the past states and update all intermediate steps via etraces would falsify the Q-Values since the same step gets updates several times.
*/
	virtual void intermediateStep(CStateCollection *oldState, CAction *action, rlt_real reward, CStateCollection *nextState);

/// Resets the Etraces
	virtual void newEpisode();

/// Sets the gamma value of the Q-Function (discount factor)
//	void setGamma(rlt_real gamma);
/// Sets the learning rate
 	void setAlpha(rlt_real alpha);
/// Sets the lambda parameter of the etraces.
	void setLambda(rlt_real lambda);

	CAgentController* getEstimationPolicy();
	void setEstimationPolicy(CAgentController * estimationPolicy);

	CAbstractQFunction* getQFunction();

	CAbstractQETraces *getETraces();
};

/// Class for Q-Learning
/**Q-Learning chooses always the best action for the state s_{t+1}, which doesn't have to be the action
executed in the state s_{t+1}, since exploration policies might choose another action. So Q-Learning is Off-Policy learning, it doesn抰 learn a the values for the agent's policy, but for the optimal policy. 
<p>
The class is just a normal TD-Learner, initializing the estimation policy with a CQGreedyPolicy object.
*/

class CQLearner : public CTDLearner
{
public:
	CQLearner(CRewardFunction *rewardFunction, CAbstractQFunction *qfunction);
	~CQLearner();
};

/// Class for Sarsa Learning
/**
The other possibility for choosing the action a_{t+1} is to choose always the action which is rlt_really executed by the agent. This Method is called SARSA learning (you have a
(S)tate-(A)ction-(R)eward-(S)tate-(A)ction tuple for update). This method learns the policy of the agent directly. Which method (Q or Sarsa Learning) works better depends on the learning
problem, generally SARSA learning is more save if you have some states with high negative reward, since SARSA learning takes the exploration policy of the agent into account.
<p>
Since the sarsa algorithm needs to know what the agent will do in the next step, it gets a pointer to the agent. The agent serves as deterministic controller, saving the action coming from his controller. The learner can use the agen't getNextAction method to get the next extimated action. The advantage that the estimation policy is the policy of the agent is that the ETraces of the Sarsa Learner only have to be reset when a new Episode begins. This can lead to better performance as the Q-Learning Algorithm.
<p>
The Sarsa learner supposes a deterministic controller as estimation policy, which is usually the agent or a hierarchic MDP. 
*/
class CSarsaLearner : public CTDLearner
{
public:
	CSarsaLearner(CRewardFunction *rewardFunction, CAbstractQFunction *qfunction, CDeterministicController *agent);
	~CSarsaLearner();
};


class CTDGradientLearner : public CTDLearner
{
protected:
	CResidualFunction *residual;
	CResidualGradientFunction *residualGradient;
	CGradientQFunction *gradientQFunction;
	CGradientQETraces *gradientQETraces;

	CFeatureList *oldGradient;
	CFeatureList *newGradient;
	CFeatureList *residualGradientFeatures;

	virtual rlt_real getResidual(rlt_real oldQ, rlt_real reward, int duration, rlt_real newQ);
	virtual void addETraces(CStateCollection *oldState, CStateCollection *newState, CAction *action);

public:
	CTDGradientLearner(CRewardFunction *rewardFunction, CGradientQFunction *qfunction, CAgentController *agent, CResidualFunction *residual, CResidualGradientFunction *residualGradient);

	~CTDGradientLearner();
};

class CTDResidualLearner : public CTDGradientLearner
{
protected:
	
	CGradientQETraces *residualGradientTraces;
	CGradientQETraces *directGradientTraces;

	CGradientQETraces *residualETraces;

	CAbstractBetaCalculator *betaCalculator;

	virtual void learnStep(CStateCollection *oldState, CAction *action, rlt_real reward, CStateCollection *nextState);

public:
	CTDResidualLearner(CRewardFunction *rewardFunction, CGradientQFunction *qfunction, CAgentController *agent, CResidualFunction *residual, CResidualGradientFunction *residualGradient, CAbstractBetaCalculator *betaCalc);

	~CTDResidualLearner();

	void newEpisode();

	virtual void addETraces(CStateCollection *oldState, CStateCollection *newState, CAction *action, rlt_real td);

	CGradientQETraces *getResidualETraces() {return residualETraces;};
};

#endif

?? 快捷鍵說明

復制代碼 Ctrl + C
搜索代碼 Ctrl + F
全屏模式 F11
切換主題 Ctrl + Shift + D
顯示快捷鍵 ?
增大字號 Ctrl + =
減小字號 Ctrl + -
亚洲欧美第一页_禁久久精品乱码_粉嫩av一区二区三区免费野_久草精品视频
91论坛在线播放| 91麻豆精品国产91久久久久 | www.欧美日韩| 制服丝袜国产精品| 一区在线中文字幕| 国产在线不卡视频| 欧美剧在线免费观看网站| 日韩一区在线免费观看| 经典一区二区三区| 欧美麻豆精品久久久久久| 综合久久久久久| 国产乱人伦偷精品视频不卡| 在线不卡免费av| 中文字幕一区二区三区视频| 国产一区二区不卡老阿姨| 911国产精品| 一区二区三区在线播| 懂色av一区二区三区免费看| 日韩免费电影一区| 五月天久久比比资源色| 在线区一区二视频| 亚洲欧美另类久久久精品| 成人午夜视频在线| 国产欧美视频在线观看| 国产精品伊人色| 欧美精品一区二区蜜臀亚洲| 麻豆成人综合网| 欧美一区二区播放| 欧美a级一区二区| 欧美精品777| 日韩电影一区二区三区| 在线播放日韩导航| 日本一不卡视频| 欧美变态tickling挠脚心| 麻豆国产91在线播放| 精品剧情v国产在线观看在线| 麻豆精品视频在线观看免费| 欧美一区二区精品在线| 毛片av中文字幕一区二区| 欧美一区二区在线免费观看| 美国精品在线观看| 久久免费美女视频| 波多野洁衣一区| 亚洲免费av高清| 欧美亚洲免费在线一区| 亚洲3atv精品一区二区三区| 欧美剧在线免费观看网站| 蜜臀99久久精品久久久久久软件| 日韩视频中午一区| 国产伦精一区二区三区| 国产欧美日韩另类一区| 91网站黄www| 日韩精品一卡二卡三卡四卡无卡| 日韩西西人体444www| 麻豆久久久久久| 中文字幕免费一区| 色播五月激情综合网| 日韩高清在线观看| 亚洲四区在线观看| 在线观看日韩国产| 日产国产高清一区二区三区| 精品国产91洋老外米糕| 99精品黄色片免费大全| 亚洲午夜久久久久久久久电影院| 日韩情涩欧美日韩视频| 懂色中文一区二区在线播放| 一区二区三区产品免费精品久久75| 欧美日韩高清一区| 成人网在线免费视频| 亚洲成av人**亚洲成av**| 亚洲精品一区二区三区99| 色一情一乱一乱一91av| 久久aⅴ国产欧美74aaa| 亚洲精品久久嫩草网站秘色| 欧美电影免费观看完整版| 色又黄又爽网站www久久| 青青草国产精品亚洲专区无| 亚洲欧洲性图库| 日韩欧美国产精品一区| 91国偷自产一区二区开放时间 | 日韩精品在线看片z| 91香蕉视频污| 国产一区二三区好的| 一区二区三区中文在线观看| 久久一区二区三区国产精品| 欧美三级日韩三级| 99国产一区二区三精品乱码| 看国产成人h片视频| 亚洲国产美女搞黄色| 国产精品电影一区二区三区| 日韩一区二区在线看片| 色婷婷精品久久二区二区蜜臂av | www成人在线观看| 欧美日韩免费观看一区三区| 91免费观看视频在线| 国产69精品久久久久毛片 | 色综合视频一区二区三区高清| 免费在线视频一区| 亚洲国产一区二区三区| 中文字幕在线不卡一区| 欧美不卡激情三级在线观看| 欧美顶级少妇做爰| 欧美日韩国产一二三| 欧美性色综合网| 一本色道a无线码一区v| 91在线视频免费91| 丁香天五香天堂综合| 国产精品自产自拍| 国产盗摄精品一区二区三区在线| 日韩黄色片在线观看| 日韩精品一二三四| 舔着乳尖日韩一区| 亚洲电影第三页| 午夜视频在线观看一区| 亚洲成av人片在线观看| 午夜精品久久久久久久99水蜜桃 | 成人午夜视频福利| k8久久久一区二区三区| 国产精品一二二区| 国产不卡在线视频| 暴力调教一区二区三区| av亚洲精华国产精华精| 色综合久久88色综合天天免费| 91免费观看在线| 欧美日韩高清影院| 91精品黄色片免费大全| 欧美va亚洲va| 久久久久9999亚洲精品| 国产精品久久久久精k8| 亚洲精品一二三| 日韩二区三区四区| 久久99国产精品久久99果冻传媒| 国内精品免费在线观看| 成人一区二区在线观看| 在线免费一区三区| 欧美挠脚心视频网站| 欧美一区二区三区白人| 日韩女同互慰一区二区| 国产女人18水真多18精品一级做| 日韩一区欧美一区| 午夜精品一区在线观看| 国产美女精品一区二区三区| 91麻豆国产福利在线观看| 欧美日韩在线播放| 国产精品久久精品日日| 一区二区日韩av| 老司机午夜精品| av在线播放不卡| 3d动漫精品啪啪一区二区竹菊| 精品国精品国产| 亚洲精品欧美激情| 免费高清视频精品| caoporn国产精品| 欧美一级理论片| 国产欧美一区二区三区网站| 亚洲国产日韩a在线播放| 国产一区999| 欧美午夜片在线观看| 久久久精品天堂| 天天操天天干天天综合网| 国产成人自拍高清视频在线免费播放| 色先锋资源久久综合| 久久久久久久久久久久久夜| 亚洲一区二区三区四区在线观看 | 欧美一级在线观看| 亚洲欧洲一区二区在线播放| 蜜桃av一区二区| 欧美午夜精品一区二区三区| 国产性色一区二区| 日本不卡视频一二三区| 色一区在线观看| 国产精品乱码一区二区三区软件| 视频一区中文字幕国产| 一本大道久久a久久精二百 | 国产91高潮流白浆在线麻豆| 538在线一区二区精品国产| 亚洲视频小说图片| 国产综合色在线视频区| 日韩三级视频在线观看| 亚洲国产精品影院| 99精品欧美一区二区蜜桃免费| 久久综合资源网| 奇米影视在线99精品| 在线观看av不卡| 最新中文字幕一区二区三区| 国产xxx精品视频大全| 欧美一区二区日韩一区二区| 国产在线精品一区二区夜色| 欧美视频一二三区| 亚洲精品视频在线| 91视频在线观看| 亚洲欧美日韩久久| 夫妻av一区二区| 国产午夜精品一区二区三区视频| 韩国女主播成人在线| 精品日韩一区二区三区 | 国产91丝袜在线观看| 国产色爱av资源综合区| 国产精品18久久久久久久久| 精品国产亚洲一区二区三区在线观看 |