【强化学习】港中大强化学习课程Assignment解析 01_2

2022-01-18 14:30:34 阅读：161 来源： 互联网

标签：01 港中大 self state policy table 强化 reward 1.000

【强化学习】港中大强化学习课程Assignment解析 01_2

课程相关

课程首页：https://cuhkrlcourse.github.io/
视频链接：https://space.bilibili.com/511221970/channel/seriesdetail?sid=764099【B站】
相关资料：https://datawhalechina.github.io/easy-rl/#/【EasyRL】
Reinforcement Learning: An Introduction：https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
Github首页（作业获取）：https://github.com/cuhkrlcourse/ierg5350-assignment-2021
Gitee（我的解析）：https://gitee.com/cstern-liao/cuhk_rl_assignment

2 有模型的表格型方法

Model-based vs. Model-free

智能体按照是否对真实世界建模分为有模型和无模型两类，有模型(Model-based)的智能体对真实世界建模成一个虚拟世界，智能体可以通过状态转移函数 P ( s t + 1 ∣ s t , a t ) P(s_{t+1}\ |\ s_t, a_t) P(st+1 ∣ st,at) 和奖励函数 R ( s t , a t ) R(s_t, a_t) R(st,at) 来预测在某个状态采取某个动作之后会转移到哪个状态，获得怎样的奖励，这样智能体可以直接通过学习策略或者价值函数来最大化奖励。但在真实世界中大部分问题我们没有办法得到环境中的全部元素，他的状态转移函数和奖励函数对我们来说是无法感知的，**这时就需要采用免模型学习。**免模型学习没有对真实环境进行建模，智能体只能在真实环境中通过一定的策略来执行动作，等待奖励和状态迁移，然后根据这些反馈信息来更新行为策略，这样反复迭代直到学习到最优策略。

这里Assignment作业使用的是有模型的表格型方法。

在Section2中，题目给出了父类TabularRLTrainerAbstract的定义。

# Run this cell without modification

class TabularRLTrainerAbstract:
    """This is the abstract class for tabular RL trainer. We will inherent the specify 
    algorithm's trainer from this abstract class, so that we can reuse the codes like
    getting the dynamic of the environment (self._get_transitions()) or rendering the
    learned policy (self.render())."""
    
    def __init__(self, env_name='FrozenLake8x8-v1', model_based=True):
        self.env_name = env_name
        self.env = gym.make(self.env_name)
        self.action_dim = self.env.action_space.n
        self.obs_dim = self.env.observation_space.n
        
        self.model_based = model_based

    def _get_transitions(self, state, act):
        """Query the environment to get the transition probability,
        reward, the next state, and done given a pair of state and action.
        We implement this function for you. But you need to know the 
        return format of this function.
        """
        self._check_env_name()
        assert self.model_based, "You should not use _get_transitions in " \
            "model-free algorithm!"
        
        # call the internal attribute of the environments.
        # `transitions` is a list contain all possible next states and the 
        # probability, reward, and termination indicater corresponding to it
        transitions = self.env.env.P[state][act]

        # Given a certain state and action pair, it is possible
        # to find there exist multiple transitions, since the 
        # environment is not deterministic.
        # You need to know the return format of this function: a list of dicts
        ret = []
        for prob, next_state, reward, done in transitions:
            ret.append({
                "prob": prob,
                "next_state": next_state,
                "reward": reward,
                "done": done
            })
        return ret
    
    def _check_env_name(self):
        assert self.env_name.startswith('FrozenLake')

    def print_table(self):
        """print beautiful table, only work for FrozenLake8X8-v0 env. We 
        write this function for you."""
        self._check_env_name()
        print_table(self.table)

    def train(self):
        """Conduct one iteration of learning."""
        raise NotImplementedError("You need to override the "
                                  "Trainer.train() function.")

    def evaluate(self):
        """Use the function you write to evaluate current policy.
        Return the mean episode reward of 1000 episodes when seed=0."""
        result = evaluate(self.policy, 1000, env_name=self.env_name)
        return result

    def render(self):
        """Reuse your evaluate function, render current policy 
        for one episode when seed=0"""
        evaluate(self.policy, 1, render=True, env_name=self.env_name)

它是一个抽象类，这意味着里面有一个函数train()需要在接下来进行重写，我们将在2.1节和2.2节中继承这个抽象类并重写train方法，分别实现策略迭代和价值迭代的过程。重点看一下其中 _get_transitions(self, state, act) 函数

2.1 策略迭代

首先回顾一下策略迭代算法：

在给定环境转移的情况下，更新价值函数直到收敛。第一步是一个小循环，价值函数与上一轮小循环的值相差很小（即收敛）时退出。

v k + 1 = E s ′ [ R ( s , a ) + γ v k ( s ′ ) ] v_{k+1}=E_{s'}[R(s,a)+\gamma v_k(s')] vk+1=Es′[R(s,a)+γvk(s′)]

其中 a a a 由当前迭代的策略函数给出， s ′ s' s′ 是下一个状态， R R R 是奖励函数， v k ( s ′ ) v_k(s') vk(s′)是上一个小循环中下一个状态的价值
找到该轮迭代中能使价值函数最大化的最优策略

a = a r g m a x a E s ′ [ R ( s , a ) + γ v k ( s ′ ) ] a=argmax_aE_{s'}[R(s,a)+\gamma v_k(s')] a=argmaxaEs′[R(s,a)+γvk(s′)]
如果找到的最优策略跟前一轮一致，则停止迭代，否则回到第一步继续迭代

综上，策略迭代算法有一个外循环和一个内循环（在第一步中得到收敛的价值函数）

接下来我们创建一个策略迭代的子类***(PolicyIterationTrainer)***继承上面所说的抽象父类。

class PolicyItertaionTrainer(TabularRLTrainerAbstract):
    def __init__(self, gamma=1.0, eps=1e-10, env_name='FrozenLake8x8-v1'):
        # ...

    def train(self):
        # ...

    def update_value_function(self):
        # ...

    def update_policy(self):
        # ...

这个抽象类定义里面有四个**[TODO]**，我们一个个来看：

初始化函数中要求在开始时生成一个随机策略，这里我们随机写一个就好，一开始可以根据我们的判断尽量让智能体趋向于去完成这个episode，比如这里我们让它有向右下走的趋势，而不是直接random.choice()，这样可以减少迭代的次数，稍微提高效率，这个策略我们会在接下来通过迭代去寻找最优策略。

def __init__(self, gamma=1.0, eps=1e-10, env_name='FrozenLake8x8-v1'):
        super(PolicyItertaionTrainer, self).__init__(env_name)

        # discount factor
        self.gamma = gamma

        # value function convergence criterion
        self.eps = eps

        # build the value table for each possible observation
        self.table = np.zeros((self.obs_dim,))

        # [TODO] you need to implement a random policy at the beginning.
        # It is a function that take an integer (state or say observation)
        # as input and return an interger (action).
        # remember, you can use self.action_dim to get the dimension (range)
        # of the action, which is an integer in range
        # [0, ..., self.action_dim - 1]
        # hint: generating random action at each call of policy may lead to
        #  failure of convergence, try generate random actions at initializtion
        #  and fix it during the training.
        self.policy = lambda _: DOWN if (obs + 1) % 8 == 0 or (obs + 8) < 64 else RIGHT
        # test your random policy
        test_random_policy(self.policy, self.env)

接下来是***train()***方法中，要求如果有必要的话，每次大循环把价值函数重置。经过测试，是否重置在现在这个小作业中对最终每个状态的价值和平均奖励没有太大影响，只是影响每次迭代的次数，但是我理解的价值迭代的原理是每次大循环通过新的策略函数在原始的基础上获得价值函数，所以我认为这里应该重置清0。
```
def train(self):
    """Conduct one iteration of learning."""
    # [TODO] value function may be need to be reset to zeros.
    # if you think it should, than do it. If not, then move on.
    # hint: the value function is equivalent to self.table,
    #  a numpy array with length 64.
    self.table = np.zeros((self.obs_dim,))
    self.update_value_function()
    self.update_policy()
```
继续往下，train()方法依次执行了***update_value_function()*** 和 update_policy() 函数，接下来我们介绍。
update_value_function() 我们逐句来看。
```
def update_value_function(self):
    count = 0  # count the steps of value updates
    while True:
        old_table = self.table.copy()
        # 复制一份旧的价值表

        for state in range(self.obs_dim):
            act = self.policy(state)	# 用当前策略函数生成当前状态决策的动作
            transition_list = self._get_transitions(state, act)	# 调用函数获得在当前状态采取当前动作后的状态转移情况
            
            state_value = 0
            for transition in transition_list:
                prob = transition['prob']	# 转移到某状态的概率
                reward = transition['reward']	# 转移到该状态后的即时奖励
                next_state = transition['next_state']	# 转移到的下一个状态
                done = transition['done']	# 是否结束
                
                # [TODO] what is the right state value?
                # hint: you should use reward, self.gamma, old_table, prob,
                # and next_state to compute the state value
                state_value += prob * (reward + self.gamma * old_table[next_state])	# 关键步骤

            # update the state value
            self.table[state] = state_value

        # [TODO] Compare the old_table and current table to
        #  decide whether to break the value update process.
        # hint: you should use self.eps, old_table and self.table
        should_break = np.sum(np.abs(old_table - self.table)) < self.eps	# 判断是否收敛
        if should_break:
           break
```
其中遍历每种可能的转移情况内的关键步骤 s t a t e _ v a l u e + = p r o b ∗ ( r e w a r d + s e l f . g a m m a ∗ o l d t a b l e [ n e x t s t a t e ] ) state\_value += prob * (reward + self.gamma * old_table[next_state]) state_value+=prob∗(reward+self.gamma∗oldtable[nextstate])对应的就是公式 v k + 1 = E s ′ [ R ( s , a ) + γ v k ( s ′ ) ] v_{k+1}=E_{s'}[R(s,a)+\gamma v_k(s')] vk+1=Es′[R(s,a)+γvk(s′)]

在每次循环后判断是否收敛来决定是否停止循环
1. update_policy() 逐句来看
```
def update_policy(self):
    """You need to define a new policy function, given current
    value function. The best action for a given state is the one that
    has greatest expected return.

    To optimize computing efficiency, we introduce a policy table,
    which take state as index and return the action given a state.
    """
    policy_table = np.zeros([self.obs_dim, ], dtype=np.int64)

    for state in range(self.obs_dim):
        state_action_values = [0] * self.action_dim

        # [TODO] assign the action with greatest "value"
        # to policy_table[state]
        # hint: what is the proper "value" here?
        #  you should use table, gamma, reward, prob,
        #  next_state and self._get_transitions() function
        #  as what we done at self.update_value_function()
        #  Bellman equation may help.
        best_action = None
        # 对可能采取的动作遍历
        for action in range(self.action_dim):
            transition_list = self._get_transitions(state, action)
            for transition in transition_list:
                prob = transition['prob']
                reward = transition['reward']
                next_state = transition['next_state']
                done = transition['done']
                # 计算在采取当前动作的情况下，当前状态的价值
                state_action_values[action] += prob * (reward + self.gamma * self.table[next_state])
            best_action = np.argmax(np.array(state_action_values))	# 取价值最大时对应的动作
        
        policy_table[state] = best_action	# 更新在某个状态下应该采取的动作

    self.policy = lambda obs: policy_table[obs]
```

接下来我们看一下运行函数：

# Managing configurations of your experiments is important for your research.
default_pi_config = dict(
    max_iteration=1000,
    evaluate_interval=1,
    gamma=1.0,
    eps=1e-10
)


def policy_iteration(train_config=None):
    config = default_pi_config.copy()
    if train_config is not None:
        config.update(train_config)
        
    trainer = PolicyItertaionTrainer(gamma=config['gamma'], eps=config['eps'])

    old_policy_result = {
        obs: -1 for obs in range(trainer.obs_dim)
    }

    for i in range(config['max_iteration']):
        # train the agent
        trainer.train()  # [TODO] please uncomment this line

        # [TODO] compare the new policy with old policy to check whether
        #  should we stop. If new and old policy have same output given any
        #  observation, them we consider the algorithm is converged and
        #  should be stopped.
        new_policy_result = {
            state: trainer.policy(state) for state in range(trainer.obs_dim)
        }
        should_stop = (new_policy_result == old_policy_result)

        if should_stop:
            print("We found policy is not changed anymore at "
                  "itertaion {}. Current mean episode reward "
                  "is {}. Stop training.".format(i, trainer.evaluate()))
            break
        old_policy_result = new_policy_result

    return trainer

pi_agent = policy_iteration()
pi_agent.print_table()

________________________________________
[INFO]	In 0 iteration, current mean episode reward is 0.822.
[DEBUG]	Updated values for 200 steps. Difference between new and old table is: 0.041664161897299004
[DEBUG]	Updated values for 400 steps. Difference between new and old table is: 0.0022292041480653085
[DEBUG]	Updated values for 600 steps. Difference between new and old table is: 0.0001184338151329345
[DEBUG]	Updated values for 800 steps. Difference between new and old table is: 6.291939822350434e-06
[DEBUG]	Updated values for 1000 steps. Difference between new and old table is: 3.3426684917237104e-07
[DEBUG]	Updated values for 1200 steps. Difference between new and old table is: 1.7758327711114852e-08
[DEBUG]	Updated values for 1400 steps. Difference between new and old table is: 9.434331232904825e-10
[INFO]	In 1 iteration, current mean episode reward is 0.804.
[DEBUG]	Updated values for 200 steps. Difference between new and old table is: 0.0005348034918033623
[DEBUG]	Updated values for 400 steps. Difference between new and old table is: 4.20104129661425e-06
[DEBUG]	Updated values for 600 steps. Difference between new and old table is: 2.8070459692774996e-08
[DEBUG]	Updated values for 800 steps. Difference between new and old table is: 1.7462087331665543e-10
[INFO]	In 2 iteration, current mean episode reward is 0.77.
[DEBUG]	Updated values for 200 steps. Difference between new and old table is: 0.0004257477615745714
[DEBUG]	Updated values for 400 steps. Difference between new and old table is: 1.4125290733302265e-05
[DEBUG]	Updated values for 600 steps. Difference between new and old table is: 3.971402302571647e-07
[DEBUG]	Updated values for 800 steps. Difference between new and old table is: 1.0301546324309463e-08
[DEBUG]	Updated values for 1000 steps. Difference between new and old table is: 2.548876110175513e-10
[INFO]	In 3 iteration, current mean episode reward is 0.688.
[DEBUG]	Updated values for 200 steps. Difference between new and old table is: 0.00019132880435357436
[DEBUG]	Updated values for 400 steps. Difference between new and old table is: 1.8012092146968417e-05
[DEBUG]	Updated values for 600 steps. Difference between new and old table is: 1.3738229058951612e-06
[DEBUG]	Updated values for 800 steps. Difference between new and old table is: 9.513477411404736e-08
[DEBUG]	Updated values for 1000 steps. Difference between new and old table is: 6.230829824316331e-09
[DEBUG]	Updated values for 1200 steps. Difference between new and old table is: 3.9353281744425317e-10
[INFO]	In 4 iteration, current mean episode reward is 0.829.
[INFO]	In 5 iteration, current mean episode reward is 0.867.
We found policy is not changed anymore at itertaion 6. Current mean episode reward is 0.867. Stop training.

______________________________________________________________
+-----+-----+-----State Value Mapping-----+-----+-----+
|     |   0 |   1 |   2 |   3 |   4 |   5 |   6 |   7 |
|-----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0   |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 1   |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 2   |1.000|0.978|0.926|0.000|0.857|0.946|0.982|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 3   |1.000|0.935|0.801|0.475|0.624|0.000|0.945|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 4   |1.000|0.826|0.542|0.000|0.539|0.611|0.852|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 5   |1.000|0.000|0.000|0.168|0.383|0.442|0.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 6   |1.000|0.000|0.195|0.121|0.000|0.332|0.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 7   |1.000|0.732|0.463|0.000|0.277|0.555|0.777|0.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+

可以看到，我们在外层循环中循环调用trainer的train方法，并在每次更新策略后，通过比较更新前后策略是否一致（或达到最大迭代数）来决定是否停止迭代。

最终策略达到最优时我们可以得到的平均奖励为0.867，对应的价值函数如上表。

2.2 价值迭代

价值迭代与策略迭代的不同就是：策略迭代是在当前策略下将价值函数迭代到完全收敛后更新策略；而价值函数是每次循环只迭代一次价值函数，然后紧接着更新策略，直到价值收敛。前者的迭代停止条件是策略不变，后者的停止条件是价值收敛。

def train(self):
    """Conduct one iteration of learning."""
    # [TODO] value function may be need to be reset to zeros.
    # if you think it should, than do it. If not, then move on.
    # self.table = np.zeros((self.obs_dim,))

    # In value iteration, we do not explicit require a
    # policy instance to run. We update value function
    # directly based on the transitions. Therefore, we
    # don't need to run self.update_policy() in each step.
    self.update_value_function()

def update_value_function(self):
    old_table = self.table.copy()

    for state in range(self.obs_dim):
        state_value = 0

        # [TODO] what should be de right state value?
        # hint: try to compute the state_action_values first
        action_state_value = np.zeros(self.action_dim)
        for action in range(self.action_dim):
            transition_list = self._get_transitions(state, action)
            for transition in transition_list:
                prob = transition['prob']
                reward = transition['reward']
                next_state = transition['next_state']
                done = transition['done']
                action_state_value[action] += prob * (reward + self.gamma * old_table[next_state])

        self.table[state] = max(action_state_value)
        
        
def evaluate(self):
    """Since in value itertaion we do not maintain a policy function,
        so we need to retrieve it when we need it."""
    self.update_policy()
    return super().evaluate()

所以只有这两处是不同的，可以注意到这里的train方法内，不应该把table重新置0，因为我们的目的是让价值函数收敛，并且每次循环只执行一步，如果重置很可能会陷入死循环中。

for i in range(config['max_iteration']):
    old_state_value_table = trainer.table.copy()
    # train the agent
    trainer.train()  # [TODO] please uncomment this line
    # evaluate the result
    if i % config['evaluate_interval'] == 0:
        print("[INFO]\tIn {} iteration, current "
              "mean episode reward is {}.".format(
            i, trainer.evaluate()
        ))

        # [TODO] compare the new policy with old policy to check should
        #  we stop.
        # [HINT] If new and old policy have same output given any
        #  observation, them we consider the algorithm is converged and
        #  should be stopped.
        should_stop = (np.sum(np.abs(old_state_value_table - trainer.table)) < default_vi_config['eps'])
        
        if should_stop:
            print("We found policy is not changed anymore at "
                  "itertaion {}. Current mean episode reward "
                  "is {}. Stop training.".format(i, trainer.evaluate()))
            break

vi_agent = value_iteration()
vi_agent.render()
vi_agent.print_table()

_______________________________________
[INFO]	In 0 iteration, current mean episode reward is 0.0.
[INFO]	In 100 iteration, current mean episode reward is 0.892.
[INFO]	In 200 iteration, current mean episode reward is 0.867.
[INFO]	In 300 iteration, current mean episode reward is 0.867.
[INFO]	In 400 iteration, current mean episode reward is 0.867.
[INFO]	In 500 iteration, current mean episode reward is 0.867.
[INFO]	In 600 iteration, current mean episode reward is 0.867.
[INFO]	In 700 iteration, current mean episode reward is 0.867.
[INFO]	In 800 iteration, current mean episode reward is 0.867.
[INFO]	In 900 iteration, current mean episode reward is 0.867.
[INFO]	In 1000 iteration, current mean episode reward is 0.867.
[INFO]	In 1100 iteration, current mean episode reward is 0.867.
[INFO]	In 1200 iteration, current mean episode reward is 0.867.
[INFO]	In 1300 iteration, current mean episode reward is 0.867.
[INFO]	In 1400 iteration, current mean episode reward is 0.867.
[INFO]	In 1500 iteration, current mean episode reward is 0.867.
[INFO]	In 1600 iteration, current mean episode reward is 0.867.
We found policy is not changed anymore at itertaion 1600. Current mean episode reward is 0.867. Stop training.
_____________________________________
+-----+-----+-----State Value Mapping-----+-----+-----+
|     |   0 |   1 |   2 |   3 |   4 |   5 |   6 |   7 |
|-----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0   |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 1   |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 2   |1.000|0.978|0.926|0.000|0.857|0.946|0.982|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 3   |1.000|0.935|0.801|0.475|0.624|0.000|0.945|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 4   |1.000|0.826|0.542|0.000|0.539|0.611|0.852|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 5   |1.000|0.000|0.000|0.168|0.383|0.442|0.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 6   |1.000|0.000|0.195|0.121|0.000|0.332|0.000|1.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 7   |1.000|0.732|0.463|0.000|0.277|0.555|0.777|0.000|
|     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+

调用是这样的，这里我们通过trainer.evaluate()方法复用父类的update_policy()方法。

用训练好的智能体玩冰面游戏

在这里插入图片描述

标签：01,港中大,self,state,policy,table,强化,reward,1.000
来源： https://blog.csdn.net/Liao164462791/article/details/122559360

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

【强化学习】港中大强化学习课程Assignment解析 01_2

【强化学习】港中大强化学习课程Assignment解析 01_2

2 有模型的表格型方法

2.1 策略迭代

2.2 价值迭代

用训练好的智能体玩冰面游戏