强化学习环境学习-gym[atari]-paper中的相关设置

强化学习环境学习-gym[atari]

0. gym 核心

这部分的代码在gym/core.py中,

原始基类为Env,主要可调用step,reset,render,close,seed几个方法,大体框架如下

class Env(object):
    def reset(self):
        pass
    def step(self, action):
        pass
    def render(self, mode='human'):
        pass
    def close(self):
        pass
    def seed(self, seed=None):
        pass

同时Wrapper包装器继承Env类

class Wrapper(Env):
    def step(self, action):
        return self.env.step(action)

    def reset(self, **kwargs):
        return self.env.reset(**kwargs)

    def render(self, mode='human', **kwargs):
        return self.env.render(mode, **kwargs)

    def close(self):
        return self.env.close()

    def seed(self, seed=None):
        return self.env.seed(seed)

包装器的作用在于我们想定制新的环境配置时可以直接继承Wrapper,重写其中的部分方法,使用时将选择的游戏env作为参数传递进去,即可更改相应游戏环境的配置.

相应的也有observation,reward,action的包装器,更改对应方法即可

class ObservationWrapper(Wrapper):
    def reset(self, **kwargs):
        observation = self.env.reset(**kwargs)
        return self.observation(observation)

    def step(self, action):
        observation, reward, done, info = self.env.step(action)
        return self.observation(observation), reward, done, info

    def observation(self, observation):
        raise NotImplementedError

1. 环境名

atari中的每个游戏环境通过后缀名来区分内部的细微区别.

以Pong游戏为例,Pong-ram-v0表示其observation为Atari机器的内存情况(256维向量表示).其它的环境表示observation来自一个210*160的输入图片,具体区别可以细分为下(来自endtoend.ai/envs/gym/at)

其中带有V0后缀的表示以一定的概率p重复之前动作,不受智能体的控制(sticky actions,增加环境的随机性,[Revisiting the Arcade]),v4后缀表示这个概率p为0.中间字段的不同表示智能体每隔k帧做一个动作(同一个动作在k帧中保持,这种设置可以为避免训练出的智能体超出人类的反应速率).

使用如下代码可以看到所有环境:

from gym import envs
env_names = [spec.id for spec in envs.registry.all()] 
for name in sorted(env_names): 
    print(name)

2.增加配置

除了环境自带的配置外,实验前常常对环境进行一系列新的配置,通常对gym.Wrapper进行继承重写其中的方法

2.1 reset规则

整个Atari游戏环境是一个确定性的环境,一个智能体可能在确定性环境中表现良好,但可能对一点细小的扰动高度敏感,所以常在设置中增加随机性

class NoopResetEnv(gym.Wrapper):
    def __init__(self, env, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        gym.Wrapper.__init__(self, env)
        self.noop_max = noop_max
        self.override_num_noops = None
        self.noop_action = 0
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def reset(self, **kwargs):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset(**kwargs)
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = self.unwrapped.np_random.randint(1, self.noop_max + 1) #pylint: disable=E1101
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(self.noop_action)
            if done:
                obs = self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)

reset后保持一段随机步数的空操作

class FireResetEnv(gym.Wrapper):
    def __init__(self, env):
        """Take action on reset for environments that are fixed until firing."""
        gym.Wrapper.__init__(self, env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def reset(self, **kwargs):
        self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset(**kwargs)
        return obs

    def step(self, ac):
        return self.env.step(ac)

有时智能体不好学习开火的策略,因此在reset后试图做出开火的动作(action[1]) github.com/openai/basel 有进一步的讨论

2.3 Episode termination

在有些游戏中一个玩家可能有多条生命,将游戏的终止作为训练episode的终止可能不利于智能体学习到丢失生命的重要性(Mnih et al. (2015))

class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env):
        """Make end-of-life == end-of-episode, but only reset on true game over.
        Done by DeepMind for the DQN and co. since it helps value estimation.
        """
        gym.Wrapper.__init__(self, env)
        self.lives = 0
        self.was_real_done  = True

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()

        if lives < self.lives and lives > 0:
            # for Qbert sometimes we stay in lives == 0 condition for a few frames
            # so it's important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

    def reset(self, **kwargs):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs = self.env.reset(**kwargs)
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, _, _ = self.env.step(0)
        self.lives = self.env.unwrapped.ale.lives()
        return obs

程序中将was_real_done设置游戏是否真结束的标志,而每一次丢失生命作为done的标志

尽管这种做法可能教智能体避免死亡,Bellemare et al. (2016b)提到可能对智能体最终性能有害,同时也要考虑到最小化游戏信息的使用.

2.4 Fram skipping

Atari游戏默认帧率为60帧/s,如果我们想自定义帧率,我们选择使用NoFrameskip版本,然后进行环境配置.

class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env, skip=4):
        """Return only every `skip`-th frame"""
        gym.Wrapper.__init__(self, env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
        done = None
        for i in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            if i == self._skip - 2: self._obs_buffer[0] = obs
            if i == self._skip - 1: self._obs_buffer[1] = obs
            total_reward += reward
            if done:
                break
        # Note that the observation on the done=True frame
        # doesn't matter
        max_frame = self._obs_buffer.max(axis=0)

        return max_frame, total_reward, done, info

    def reset(self, **kwargs):
        return self.env.reset(**kwargs)

Frame skipping版本(Naddaf, 2010)的环境代码如上,代码以skip=4为例,其中的每一帧执行同样的动作,计算累积reward作为step输出,最后返回的observation采用max pooling思想,取最近两个observation的最大作为输出(Montfort & Bogost, 2009)

2.5 reward裁剪和observation裁剪

class ClipRewardEnv(gym.RewardWrapper):
    def __init__(self, env):
        gym.RewardWrapper.__init__(self, env)

    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)

reward裁剪按照reward的正负性将其分为{+1, 0, -1}(Mnih et al., 2015),防止不同环境下reward的差异大对算法的影响.类似的做法还可以将获得的reward值除以第一个获得的非零reward(Bellemare et al., 2013),即假定第一个reward是独特的.

class WarpFrame(gym.ObservationWrapper):
    def __init__(self, env, width=84, height=84, grayscale=True):
        """Warp frames to 84x84 as done in the Nature paper and later work."""
        gym.ObservationWrapper.__init__(self, env)
        self.width = width
        self.height = height
        self.grayscale = grayscale
        if self.grayscale:
            self.observation_space = spaces.Box(low=0, high=255,
                shape=(self.height, self.width, 1), dtype=np.uint8)
        else:
            self.observation_space = spaces.Box(low=0, high=255,
                shape=(self.height, self.width, 3), dtype=np.uint8)

    def observation(self, frame):
        if self.grayscale:
            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
        if self.grayscale:
            frame = np.expand_dims(frame, -1)
        return frame

将原始图片210x160大小转换为84x84大小,同时将彩色图转变为灰度图

class ScaledFloatFrame(gym.ObservationWrapper):
    def __init__(self, env):
        gym.ObservationWrapper.__init__(self, env)
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)

    def observation(self, observation):
        # careful! This undoes the memory optimization, use
        # with smaller replay buffers only.
        return np.array(observation).astype(np.float32) / 255.0

将原始观测的范围从[0,255]转换为[0,1]

84x84图片float32类型所占大小为84x84x4=28,224 bytes,我们知道一个replay buffer一般定义为10e6大小,如果只计算当前观测和下一个观测的大小,所占内存空间为2x 28,224x10e6= 56,448 MB,故需要采取内存优化

可以先不进行ScaledFloat转换,即直接采用[0,255]unint8存放到replay buffer中,占用内存空间减小到原来的1/4,然后在输入神经网络时再缩放到[0,1]

2.6 Frame Stacking

仅使用一帧的图像作为observation对智能体来说可能信息不够,Frame Stack 技术采用前k帧的图像信息组合为observation避免环境陷入部分可观测问题的可能.

class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.
        Returns lazy array, which is much more memory efficient.
        See Also
        --------
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[:-1] + (shp[-1] * k,)), dtype=env.observation_space.dtype)

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))
class LazyFrames(object):
    def __init__(self, frames):
        """This object ensures that common frames between the observations are only stored once.
        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
        buffers.
        This object should only be converted to numpy array before being passed to the model.
        You'd not believe how complex the previous solution was."""
        self._frames = frames
        self._out = None

    def _force(self):
        if self._out is None:
            self._out = np.concatenate(self._frames, axis=-1)
            self._frames = None
        return self._out

    def __array__(self, dtype=None):
        out = self._force()
        if dtype is not None:
            out = out.astype(dtype)
        return out

    def __len__(self):
        return len(self._force())

    def __getitem__(self, i):
        return self._force()[..., i]

LazyFrames 是为了优化1M replay buffer 在内存的性能.

3. 例子

结合上面的各种配置,我们对一般的强化学习Atari前的预设置如下

def make_atari(env_id, max_episode_steps=None):
    env = gym.make(env_id)
    assert 'NoFrameskip' in env.spec.id
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    if max_episode_steps is not None:
        env = TimeLimit(env, max_episode_steps=max_episode_steps)
    return env

def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False):
    """Configure environment for DeepMind-style Atari.
    """
    if episode_life:
        env = EpisodicLifeEnv(env)
    if 'FIRE' in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)
    env = WarpFrame(env)
    if scale:
        env = ScaledFloatFrame(env)
    if clip_rewards:
        env = ClipRewardEnv(env)
    if frame_stack:
        env = FrameStack(env, 4)
    return env
参考资料
Atari Environments.endtoend.ai/envs/gym/at
Revisiting the Arcade Learning Environment:Evaluation Protocols and Open Problems for General Agents Marlos
openai baseline