【自动驾驶】使用强化学习玩赛车游戏

TORCS(The Open Racing Car Simulator)是一款自由开放源代码的赛车模拟器,可以在计算机上运行。它提供了一系列的汽车、赛道和物理模型,让用户可以模拟赛车比赛并进行测试。TORCS 还支持 AI 参与比赛,用户可以编写程序来驾驶赛车,这些程序可以使用多种编程语言编写,本文采用 Python编程,在windows系统下使用强化学习DDPG算法作为 AI 策略,完成赛车游戏任务。

1 python环境安装与配置

创建一个python环境,这里使用anaconda创建了一个新环境,如果已有python环境,可以不进python环境配置。

conda create -n rl_torcs python=3.6

image-20230410161228354

安装所需的库文件

pip install gym
pip install matplotlib
pip install torch

2 TORCS软件安装与配置

TORCS官网:https://torcs.sourceforge.net/

image-20230410162106951

下载完成可执行文件后进行安装即可,安装成功会在桌面出现以下图标:

image-20230410162142204

此时可以进行竞赛游戏,但不能进行程序控制,需要安装锦标赛平台的接口:

Championship Platform地址:https://sourceforge.net/projects/cig/files/SCR Championship/

下载完成后,将文件解压:

image-20230429220348746

将解压的文件覆盖到torcs的安装目录:

image-20230429220725642

打开wtorcs.exe:

image-20230429221013941

选择Practice模式

image-20230429222847094

进入赛道配置

image-20230429222923932

这里可以选择赛道的形状,根据自己的需求进行选择,我们这里选择默认Accept:

image-20230429221321503

控制接口选择:

image-20230429221925259

配置完成后进入游戏,此时等待端口数据连接

image-20230429222112110

从终端可以看到,端口号是3001,所以控制程序也要通过这个端口进行控制

image-20230429222206417

3 强化学习算法

DeepMind在2016年提出深度确定性策略梯度(Deep Deterministic Policy Gradient,DDPG)算法,是结合了深度学习和确定性策略梯度方法的一种算法。

图片1

其中Critic网络的作用是估计Q值,其输入为动作值和状态值,网络的loss采用TD-error;Actor网络的作用是输出一个动作a,使这个动作在输入到Critic网络后,能够获得最大的Q值。以往的实践证明,如果只使用单个“Q神经网络”的算法,学习过程很不稳定,为解决此问题,DDPG分别为策略网络、Q网络各创建两个神经网络拷贝,一个叫做online,一个叫做target。算法更新流程如算法1所示:

image-20230429223753433

DDPG是常用的离线策略方法,在学习的过程中不断重复采集数据和更新参数两个步骤。DDPG借鉴了DQN的两个优点,使更新过程变得更加平稳:一是引入了经验回访机制,将具有马尔可夫性的轨迹数据分割为经验形式的数据,从而降低了数据的相关性;二是目标值由额外的目标网络生成,使目标值的计算不受最新网络参数的影响。DDPG 同时具有 Actor-Critic 结构和 DQN 中的目标网络, 所以包含主策略网络 、主值 网络 、目标策略网络 和目标值网络 。主策略网络主要负责与环境互动收集轨迹数 据, 并且以经验的形式随机储存在经验池中, 为了平衡探索和利用, 会给网络输出的动 作加入噪声。而主值网络是值函数 的近似, 采用最小化当前批量样本的均方误差更新 网络参数, 其损失函数如式所示:

其中, 是目标值网络计算得到的目标值, 计算需要目标策略网络向目标值网络提供动 作,其表达式为:

目标值是由额外引入的目标网络计算的, 其优点是目标值不会受最新的参数影响, 能够 减少在学习中发散和震荡的现象。主策略网络主要是利用策略梯度方式更新, 回报函数 定义为经验池数据值的期望, 表达式如式所示:

其中, 表示经验池的数据分布, 利用的是小批量的样本平均值来近似期望, 再通过 链式法则求得回报函数对主策略网络参数的梯度, 其梯度的表达式如式下:

主策略网络和主值网络都是采用梯度下降的方法更新参数, 但是两者的损失函数不同, 而且目标网络有两种参数更新方式, 一种是硬更新, 即在主网络更新若干次后直接复制 主网络; 另一种是软更新, 用指数衰减平均和主网络同步, 软更新方式表达式如下:

参考代码:https://github.com/Nomiizz

修改snakeoil3_gym.py

class Client():
    def __init__(self,H=None,p=None,i=None,e=None,t=None,s=None,d=None,vision=False):
        # If you don't like the option defaults,  change them here.
        self.vision = vision

        self.host= 'localhost'
        self.port= 3001  #注意这里要改成与软件一致的端口号
        self.sid= 'SCR'
        self.maxEpisodes=1 # "Maximum number of learning episodes to perform"
        self.trackname= 'unknown'
        self.stage= 3 # 0=Warm-up, 1=Qualifying 2=Race, 3=unknown <Default=3>
        self.debug= False
        self.maxSteps= 100000  # 50steps/second
        self.parse_the_command_line()
        if H: self.host= H
        if p: self.port= p
        if i: self.sid= i
        if e: self.maxEpisodes= e
        if t: self.trackname= t
        if s: self.stage= s
        if d: self.debug= d
        self.S= ServerState()
        self.R= DriverAction()
        self.setup_connection()

训练代码

import torch
from torch.autograd import Variable
import numpy as np
import random

from gym_torcs import TorcsEnv
import argparse
import collections

from replayBuffer import ReplayBuffer
from actorNetwork import ActorNetwork
from criticNetwork import CriticNetwork
from OU import OU
import time

import matplotlib.pyplot as plt


def train(train_indicator = 1):

    # Parameter initializations
    BUFFER_SIZE = 100000
    BATCH_SIZE = 32
    GAMMA = 0.99
    TAU = 0.001     #Target Network HyperParameters
    LRA = 0.0001    #Learning rate for Actor
    LRC = 0.001     #Lerning rate for Critic

    action_dim = 3  #Steering/Acceleration/Brake
    state_dim = 29  #of sensors input

    vision = False

    EXPLORE = 100000.
    episode_count = 1500
    max_steps = 10000
    done = False
    step = 0
    epsilon = 1

    timeout = time.time() + 60*540   # 9 hours from now

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Create the actor & critic models
    actor = ActorNetwork(state_dim).to(device)
    critic = CriticNetwork(state_dim, action_dim).to(device)

    target_actor = ActorNetwork(state_dim).to(device)
    target_critic = CriticNetwork(state_dim, action_dim).to(device)

    buff = ReplayBuffer(BUFFER_SIZE)    #Create replay buffer

     # Generate a Torcs environment
    env = TorcsEnv(vision=vision, throttle=True, gear_change=False)

    # Load the actual and target models
    print("Loading models")
    try:

        actor.load_state_dict(torch.load('actormodel_nt.pth'))
        actor.eval()
        critic.load_state_dict(torch.load('criticmodel_nt.pth'))
        critic.eval()
        print("Models loaded successfully")
    except:
        print("Cannot find the models")

    target_actor.load_state_dict(actor.state_dict())
    target_actor.eval()
    target_critic.load_state_dict(critic.state_dict())
    target_critic.eval()

    # Set the optimizer and loss criterion
    criterion_critic = torch.nn.MSELoss(reduction='sum')

    optimizer_actor = torch.optim.Adam(actor.parameters(), lr=LRA)
    optimizer_critic = torch.optim.Adam(critic.parameters(), lr=LRC)

    if torch.cuda.is_available():
        torch.set_default_tensor_type('torch.cuda.FloatTensor')
    else:
        torch.set_default_tensor_type('torch.FloatTensor')


    cum_rewards = []

    print("TORCS training begins!")
    
    for ep in range(episode_count):

        if np.mod(ep, 3) == 0:
            ob = env.reset(relaunch = True)   #relaunch TORCS every 3 episodes because of the memory leak error
        else:
            ob = env.reset()

        # State variables
        s_t = np.hstack((ob.angle, ob.track, ob.trackPos, ob.speedX, ob.speedY, ob.speedZ, ob.wheelSpinVel/100.0, ob.rpm))

        total_reward = 0.

        for i in range(max_steps):
            critic_loss = 0
            epsilon -= 1.0 / EXPLORE  # Decaying epsilon for noise addition
            a_t = np.zeros([1, action_dim])
            noise_t = np.zeros([1, action_dim])

            a_t_original = actor(torch.tensor(s_t.reshape(1, s_t.shape[0]), device=device).float())

            if torch.cuda.is_available():
                a_t_original = a_t_original.data.cpu().numpy()
            else:
                a_t_original = a_t_original.data.numpy()

            noise_t[0][0] = OU.OUnoise(a_t_original[0][0],  0.0 , 0.60, 0.30)
            noise_t[0][1] = OU.OUnoise(a_t_original[0][1],  0.5 , 1.00, 0.10)
            noise_t[0][2] =  OU.OUnoise(a_t_original[0][2], -0.1 , 1.00, 0.05)

            # Stochastic brake
            if random.random() <= 0.1:
                print("Applying the brake")
                noise_t[0][2] = OU.OUnoise(a_t_original[0][2], 0.2, 1.00, 0.10)

            alpha = train_indicator * max(epsilon, 0)
            
            a_t[0][0] = a_t_original[0][0] + alpha * noise_t[0][0]
            a_t[0][1] = a_t_original[0][1] + alpha * noise_t[0][1]
            a_t[0][2] = a_t_original[0][2] + alpha * noise_t[0][2]

            # Perform action and get env feedback
            ob, r_t, done, info = env.step(a_t[0])

            # New state variables
            s_t_new = np.hstack((ob.angle, ob.track, ob.trackPos, ob.speedX, ob.speedY, ob.speedZ, ob.wheelSpinVel/100.0, ob.rpm))

            # Add to replay buffer
            buff.push(s_t, a_t[0], r_t, s_t_new, done)


            # Do the batch update
            batch = buff.sample(BATCH_SIZE)

            states = torch.tensor(np.asarray([e[0] for e in batch]), device=device).float()
            actions = torch.tensor(np.asarray([e[1] for e in batch]), device=device).float()
            rewards = torch.tensor(np.asarray([e[2] for e in batch]), device=device).float()
            new_states = torch.tensor(np.asarray([e[3] for e in batch]), device=device).float()
            dones = np.asarray([e[4] for e in batch])
            
            y_t = torch.tensor(np.asarray([e[1] for e in batch]), device=device).float()

            # use target network to calculate target_q_value (q_prime)
            target_q_values = target_critic(new_states, target_actor(new_states))


            for k in range(len(batch)):
                if dones[k] == False:
                    y_t[k] = rewards[k] + GAMMA * target_q_values[k]
                else:
                    y_t[k] = rewards[k]


            if (train_indicator):
                
                # Critic update
                q_values = critic(states, actions)
                critic_loss = criterion_critic(y_t, q_values)
                optimizer_critic.zero_grad()
                critic_loss.backward(retain_graph=True)
                optimizer_critic.step()

                # Actor update
                # policy_loss = -torch.mean(critic(states, actor(states)))

                # optimizer_actor.zero_grad()
                # policy_loss.backward(retain_graph=True)  -------> This is leading to memory leak :(
                # optimizer_actor.step()

                a_for_grad = actor(states)
                a_for_grad.requires_grad_()    #enables the requires_grad of a_for_grad
                q_values_for_grad = critic(states, a_for_grad)
                critic.zero_grad()
                q_sum = q_values_for_grad.sum()
                q_sum.backward(retain_graph=True)

                grads = torch.autograd.grad(q_sum, a_for_grad) 

                act = actor(states)
                actor.zero_grad()
                act.backward(-grads[0])
                optimizer_actor.step()


                # update target networks 
                for target_param, param in zip(target_actor.parameters(), actor.parameters()):
                    target_param.data.copy_(TAU * param.data + (1.0 - TAU) * target_param.data)
                       
                for target_param, param in zip(target_critic.parameters(), critic.parameters()):
                    target_param.data.copy_(TAU * param.data + (1.0 - TAU) * target_param.data)


            total_reward += r_t
            s_t = s_t_new # Update the current state

            if np.mod(i, 100) == 0: 
                print("Episode", ep, "Step", step, "Action", a_t, "Reward", r_t, "Loss", critic_loss)

            step += 1
            
            if done:
                break

        if np.mod(ep, 3) == 0:
            if (train_indicator):
                print("Saving models")
                torch.save(actor.state_dict(), 'actormodel_nt.pth')
                torch.save(critic.state_dict(), 'criticmodel_nt.pth')

        cum_rewards.append(total_reward)


        print("TOTAL REWARD @ " + str(ep) +"-th Episode  : Reward " + str(total_reward))
        print("Total Steps: " + str(step))
        print("")

        if time.time() > timeout:
            break

    env.end()  # This is for shutting down TORCS
    print("Finish.")

    np.savetxt('rewards_nt.csv', np.array(cum_rewards), delimiter=',')

    episodes = np.arange(ep + 1)
    plt.xlabel("Episode")
    plt.ylabel("Total reward")
    plt.title("DDPG (No Stochastic Braking)")
    plt.plot(episodes, np.array(cum_rewards))

    plt.show()

if __name__ == "__main__":
    train()

4 实现效果

image-20230429223753433