Getting Started

The developers of the Unity ML-Agents toolkit have done a fantastic job of making the toolkit as easy to use as possible, especially if you are familiar with the Unity platform and have basic programming skills (C# and Python in particular). The toolkit also includes 10 Example Environments and a tutorial on creating an ML-Agents Unity Environment for AI training and testing. Here we provide an additional tutorial on how to get started with the Unity ML-Agents toolkit by means of the single-agent game of Wall Pong. You can complete the tutorial yourself or download the Adventures in Unity Ml-Agents (AiUMLA) Basic ML-Agents Examples package (requires Unity 2018.2.1f1 or above) from our Github repository, which includes two additional examples: Catch Ball and simple version of the classic game Pong.

NOTE: Here we assume you have basic programming skills in C# and know how to execute a Python script from a Terminal window or Anaconda prompt. If you haven’t used Unity before then we recommend you complete the Introduction to Unity tutorials at https://unity3d.com/learn before continuing.

Tutorial Outline:

Wall Pong

Wall pong is a single-agent adaptation of the classic game Pong. As illustrated below, the aim of the game is to move a paddle back a forth in order to bounce a ball against the walls (bounds) of a rectangular game area. The game involves 4 possible actions  [do nothing, fire ball (i.e., start game), move paddle left, and move paddle right], with a player (agent) receiving 1 point for every time the ball hits the paddle. The aim of the game is to keep the ball in play for as long as possible (the more times the paddle hits the ball, the higher the player/agent’s score), with the game ending if the paddle misses the ball.

Installing the ML-Agents toolkit

Before starting this tutorial you should have Unity installed on your machine (at the time of writing this tutorial we were using Unity 2018.2.1f1 on a Windows 10 machine). We will assume you are using Anaconda with Python 3.6 and that you have downloaded and installed the Unity ML-Agents toolkit from Github (this tutorial has been upgraded to ML-Agents Beta v0.5). For instructions on how to install the Unity ML-Agents v0.5 toolkit using Anaconda, Windows users should go here [Windows ML-Agents Install], Mac/Unix users should go here [Mac/Unix ML-Agents Install].

NOTE: If you haven’t installed the Unity ML-Agents toolkit, we highly recommend you install the toolkit and the required dependencies within a virtual environment. For instance, create a conda environment called “ml-agents” and install everything needed to run Unity ML-Agents within that virtual environment. Go here for instructions on how to create a virtual (conda) environment.

NOTE: We recommend that you DO NOT install the ML-Agents toolkit with GPU support unless you are absolutely sure you are going to be training your agents using image/pixel data. In most cases, particularly when first using the toolkit, your agents will be trained using vector data (i.e., arrays of position and velocity data  specifying the location and movement direction of task/game relevant environmental objects and agents), which is typically processed faster using your CPU. In short, using your GPU(s) will result in slower training compared to using your CPU, unless the observation data needed/used for training is very, very large. If you do want to use your GPU(s) for training, then we recommend you install the toolkit twice, once for CPU based training and then again in a separate virtual environment for GPU based training (i.e., called “ml-agents-gpu”). This means you can easily switch between CPU and GPU based training depending on your needs.

Setting Up The Unity Development Environment

Open Unity and from the project selector click the +NEW project button. You can call your project whatever you like (e.g, “Wall Pong” or “My First ML-Agents Game”). After you enter your project name, make sure the 3D radio button is selected and click Create Project.

Import the ML-Agents Unity Components

The first thing to do after the main Unity editor opens is to import the necessary Unity ML-Agents package components into the project. This can be done in several ways. The easiest way is to open up a file browser/explorer window on your computer, locate the directory where you downloaded, saved and installed the Unity ML-Agents toolkit, navigate to the …\ml-agents\UnitySDK\Assets\ML-Agents sub-folder and drag/copy the ML-Agents folder (with all of its content and sub-folders) into the Assets folder in the Unity Project Window (don’t worry if you get some error messages in the Unity console window, those will be resolved shortly).

NOTE: you can delete the Examples sub-folder after everything has been imported, but feel free to leave it in if you plan on exploring the Unity ML-Agents examples that come with the toolkit.

Change the Default Project Settings

The next thing to do is to change some of the default project settings. Go to the main menu bar and select Edit -> Project Settings -> Player. The Player Settings panel should open up in the Unity Inspector Window (on the right of the main Unity window if you are using the default window layout). Make the following changes in the Resolution and Presentation panel and the Other Settings panel:

Resolution and Presentation:
  • Set Fullscreen Mode to “Windowed”. In older versions of Unity turn Default is Full Screen off (unchecked).
  • Turn Run in Background on (checked) and the Display Resolution Dialog should be set to “Disabled”
Other Settings:
  • Find the Configuration section
  • For Scripting Runtime Version select Experimental ( .NET 4.x Equivalent or .NET 4.6 Equivalent). Note the Unity Editor may ask to reload, selected yes and after the Editor reloads, navigate back to Other Settings panel via Edit -> Project Settings -> Player.
  • In the input box under Scripting Define Symbols type in the flag ENABLE_TENSORFLOW. Make sure you hit enter on your keyboard after typing in the flag (again, don’t worry if you get some error messages popping up in the Unity console window).

After making the above changes, make sure you save the project: File -> Save Project

Install the TensorFlow C# Unity Plugin

The Tensorflow C# plugin can be downloaded here [TensorFlowSharpPulgin] and is necessary to run and test your agents after training. To install (import) the plugin, simply double click on the file after it has downloaded and you have uncompressed/unzipped it. Once the Unity import file window opens up, click the Import button. Alternatively, you can import it by going to the main menu bar and selecting Assets -> Import Package -> Custom Package and browse to where you download the TensorFlowSharp plugin file and click Import.

Make sure you save the project: File -> Save Project

Create a New Unity Scene

  • From the main menu select File -> New Scene.
  • Save the scene by clicking File -> Save Scene, naming the scene “WallPong”.
  • Go to the Project Window and in the Assets folder right click and select Create -> Folder.
  • Rename the folder WallPong.
  • Within this new WallPong folder create three other folders.
  • Rename one Materials, one Scripts, and one TF_Files.
  • Move the WallPong scene into the WallPong folder.
  • You Project Window should look similar to the adjacent figure (if you are using an older version of Unity you may not have a folder called Packages, that is OK).
  • Save the scene [File-> Save Scene] and save the project [File -> Save Project].

Creating the Wall Pong Arena

The Wall Pong game environment or “Arena” we are going to create is going to be as simple as possible. It will contain a bounded game area, a ball and a paddle.

Wall Pong Arena: Parent Game Object

  • Go to the scene’s Hierarchy Window.
  • Within the Hierarchy Window right click and select Create Empty.
  • Rename the game object “WallPongArena”.
  • Set the Position and Rotation transform vectors to (0, 0, 0)
  • Set the Scale vector to (1, 1, 1).
  • You can set the transform’s position, rotation and scale to the above values by entering them manually or by clicking on the small cogwheel icon on the upper far right of the Transform panel (highlighted in red in the adjacent figure) and select Reset.

NOTE: the WallPongArena game object will be the “Parent” game object for the Wall Pong Arena. Everything that makes up the Wall Pong game arena (i.e., plane, bounds, paddle, ball) will be child objects of his WallPongArena game object. Later on you will see that by doing this we will be able to create multiple Wall Pong Arenas within the same Unity Scene and, thus, significantly increase the speed of agent learning by training the agent in multiple game arenas simultaneously.

Wall Pong Arena: Plane

  • Return to the Hierarchy Window.
  • With the WallPongArena game object selected, right click and select 3D Object -> Plane.
    • You should now have a Plane game object as a child object of the WallPongArena. 
    • Set the Plane’s Position and Rotation transform vectors to (0, 0, 0)
    • Set the Scale vector to (1, 1, 0.7).
  • Add some color to the Plane, by creating a Material for the Plane object.
    • In the Project Window select the Materials folder we created earlier.
    • Right click on the file/object area of the Project Window and select Create -> Material.
    • Rename the material “Plane”.
    • In the Inspector Window click on the white color patch next to the eye-dropper icon (to the right of Albedo).
    • A color sector window should open up.
    • Enter the color code 2F6234 into the hexadecimal input field.
    • Drag the Plane material from the Project Window onto the Plane object in the in Hierarchy Window.
    • After you do this, the Plane object should now be green and if you select the Plane object, the MeshRenderer panel in the Inspector Window should look like the adjacent figure.

 

Wall Pong Arena: Bounds

  • For the bounds (walls) of the game area, we first need to create another empty game object as a child object of the WallPongArena.
  • To do this select the WallPongArena game object in the Hierarchy Window, right click and select Create Empty.
    • Rename the empty game object “Bounds”.
    • Reset the Transform so that the Position and Rotation vectors = (0, 0, 0) and the Scale vector = (1, 1, 1).
  • Create 5 Cube game objects as child objects of the empty game object Bounds.
    • Create each Cube by selecting the Bounds game object in the Hierarchy Window, then right click and selecting 3D Object -> Cube.
    • Rename and set the Transform of each Cube as follows:
      • TopBound [Position (0, 0.25, 3.5), Rotation (0, 0, 0), Scale (10.1, 0.5, 0.1)]; 
      • LeftBound [Position (-5, 0.25, 0), Rotation (0, 0, 0), Scale (0.1, 0.5, 7)]; 
      • RightBound [Position (5, 0.25, 0), Rotation (0, 0, 0), Scale (0.1, 0.5, 7)]; 
      • TopLeftCorner [Position (-4.8, 0.25, 3.39), Rotation = (0, 60, 0), Scale = (0.1, 0.5, 0.45)]; 
      • TopRightCorner [Position (4.8, 0.25, 3.39), Rotation (0, -60, 0), Scale (0.1, 0.5, 0.45)].
  • Create a Material for the bounds by select the Materials folder in the Project Window.
    • Right click in the file/object area and selecting Create -> Material.
    • Rename the material “Bounds”.
    • In the Inspector Window click on the white color patch next to the eye-dropper icon.
    • In the color sector window enter the color code B6B6B6 into the hexadecimal input field (this should be a grey color).
    • Drag the Bounds material onto each of the 5 child bounds in the Bounds object in the Hierarchy Window (i.e., onto the TopBound, LeftBound,… TopRightCorner objects).

 

Wall Pong Arena: Paddle

  • To create the Paddle, select the WallPongArena object within the Hierarchy Window, right click and select, 3D Object -> Cube.
    • Rename the Cube object “Paddle”.
    • Set the Paddle’s transform Position = (0, 0.25, -3.5), Rotation = (0, 0, 0), and Scale = (1.5, 0.5, 0.2).
  • Create a Material for the Paddle.
    • Select the Materials folder in the Project Window, right click in the file/object area and select Create -> Material.
    • Rename the material “Paddle”.
    • In the Inspector Window click on the white color patch next to the eye-dropper icon.
    • In the color sector window enter the color code 0080FF into the hexadecimal input field (this should be a blue color).
    • Drag the Paddle material from the Project Window onto the Paddle object in the Hierarchy Window.
  • We also need to add some physical movement constraints to the Paddle.
    • This is done by adding a Rigidbody to the Paddle object.
    • Select the Paddle object in the Hierarchy Window.
    • In the Inspector Window, click the Add Component button at the bottom of the window.
    • Select Physics -> Rigidbody.
    • In the Rigidbody panel, check the box next to Is Kinematic.
    • Under the Constraints options of the Rigidbody panel, check the Freeze Position Y and Z check boxes and the Freeze Rotation X, Y and Z check boxes.

 

Wall Pong Arena: Ball

  • To create the Ball, select the WallPongArena object within the Hierarchy Window, right click and select 3D Object -> Sphere.
    • Rename the Sphere object “Ball”.
    • Set the Ball’s transform Position = (0, 0.25, -3), Rotation = (0, 0, 0), and Scale = (0.5, 0.5, 0.5).
  • Create a Material for the Ball.
    • Select the Materials folder in the Project Window, right click in the file/object area and select Create -> Material.
    • Rename the material “Ball”
    • In the Inspector Window click on the white color patch next to the eye-dropper icon.
    • In the color sector window enter the color code FF9900 into the hexadecimal input field (this should be a orange/brown color).
    • Drag the Ball material from the Project Window onto the Ball object in the in Hierarchy Window
  • The Ball also needs a Rigidbody.
    • Select the Ball object in the Hierarchy Window.
    • In the Inspector Window, click the Add Component button at the bottom of the window and select Physics -> Rigidbody.
    • In the Rigidbody panel, set the Mass to 0.01 and turn Use Gravity off (uncheck the Use Gravity check box).
    • Under the Constraints options, check the Freeze Position Y and the Freeze Rotation Y.
  • The Ball also needs a Physics Material so it can bounce off the Bounds and Paddle.
    • To add a Physics Material to the Ball, select the Materials folder in the Project Window, right click in the file/object area and select Create -> Physics Material.
    • Rename the physics material “Ball”.
    • In the Inspector Window, set Dynamic and Static Friction = 0, Bounciness = 1, the Friction Combined to Minimum and the Bounce Combined to Maximum.
    • To add the physics material to the Ball make sure the Ball is selected in the Hierarchy Window.
    • Drag the Ball physics material into the Material box in the Sphere Collider panel in the Inspector Window (see below image).

Re-position the Scene’s Main Camera

  • Re-position the scene’s camera by selecting the Main Camera in the Hierarchy Window.
  • In the Inspector Window set it’s Transform so that the camera is focused on the game area.
  • If you haven’t changed the default unity settings [i.e., the display settings in the Game Window should be Standalone(1024×768)], then set the camera’s Position = (0, 8, 0) and Rotation = (90, 0, 0).
  • Save the scene [File-> Save Scene].
  • Save the project [File-> Save Project].

At this stage your Unity Editor should look something like this:

Adding the ML-Agent Components

There are three key components that need to be added to a Unity Scene in order to train/use Unity ML-Agents: Agent(s), Brain(s) and an Academy component. As detailed in the ML-Agents documentation:

  • Agents are script components that are assigned to a game objects within a Unity Scene. There are three essential functions of an Agent component: (i) collect observations from the environment/scene; (ii) perform actions within the environment/scene; and (iii) assign rewards (positive or negative). Note that you can have more than one Agent within a Unity environment/scene, but each Agent can only have one Brain (see below block diagrams). For Wall Pong, we just need one Agent component assigned (attached) to the Paddle.
  • Brains are the decision making components. A Brain enacts the state-action policy or decision logic for an Agent. In other words, it specifies what action an Agent should take based on the current set of observations and rewards the Agent has received. You can have more than one Brain within a Unity environment/scene and each Brain can be connected to more than one Agent (see below block diagrams). For Wall Pong, we need one Brain component for the one Agent component attached to the Paddle.
  • The Academy component is the interface between the Brain(s) and Agent(s) within the Unity environment/scene and the external Python API that implements the reinforcement, neural network, or machine learning algorithms used for training. It also defines the rendering quality and the simulation timescale (update speed) employed during training and testing. It is important to note that each Unity scene can only have one Academy component.

For more details about the Agents, Brains and the Academy components of the ML-Agents toolkit, see ML-Agents Overview.

 

Setting up the Academy

To add an Academy component to our Wall Pong scene we first have to create a Game Object to hold the Academy component.

  • Go to the scene’s Hierarchy Window.
  • Within the Hierarchy Window, right click and select Create Empty.
  • Rename the game object “Academy”.

The Academy component is a C# Class of the MLAgents namespace. To assign the Academy class to the Academy Game Object we must create a C# script that inherits the Academy component class.

  • Select the Scripts folder in the Project Window,
  • Right click in the file/object area and select Create -> C# Script.
  • Rename the script “WallPongAcademy”
  • Double click the script to open it.
  • After the WallPongAcademy.cs script opens in your code editor (e.g., Visual Studio), delete everything (i.e., the default C# code).
  • Copy and past the below code.
  • Save the script.
  • Go back to Unity and drag the saved WallPongAcademy C# script onto the Academy object in the Hierarchy Window.
using MLAgents;

public class WallPongAcademy : Academy {

    // for Wall Pong
    // we don't need anything else here

}

NOTE: all ML-Agents components (i.e., Agent, Brain and Academy) are derived from the MLAgents namespace. Thus, to inherit one of these component classes you need include a “using” reference to the MLAgents namespace.

At this stage you are probably wondering why we are adding this script if we are not adding any code. After you add this script to the scenes Academy object, however, you will see that by specifying that the WallPongAcademy class is derived from (i.e., inherits) the Academy class this script provides a host of configuration settings. If you select the Academy object in the Hierarchy Window you will see these settings in the Inspector Panel. 

For Wall Pong we can simply use the default Academy settings for both the Training Configuration and for the Inference Configuration.

  • For the Training Configuration, the aim is to maximize simulation speed by reducing the size of game window for complied games (i.e. set window Width and Height to a small value; for instance 80 x 80) and for compiled games or games played in the Unity Editor, reducing the graphics rendering Quality Level (which should be set to minimal or 1) and increasing the Time Scale and Target Frame Rate of the simulation to be as fast as possible (i.e., in this case 100 times the fastest possible frame rate or -1). Note that for more elaborate environments or computational intensive processes you might have to set the Times Scale to a value less than 100 (e.g., 50 or 25).
  • The Inference Configuration reflects the settings you want to use when a player, a trained agent, or a hard-coded AI is playing the game. In this case you want a decent window size (i.e. set window Width and Height to a larger value; for instance 1280 x 720) and graphics Quality Level (e.g., optimal or 5), the Time Scale set to 1 and Target Frame Rate equal to a standard update rate (e.g, 60 frames per second) .

NOTE: although for this simple Wall Pong game we don’t need to add any custom code to the Wall Pong Academy, you often need to customize the Academy for more complex games and environments. Indeed, you can customize your Academy to initialize the environment after a scene loads, reset a Unity scene or environment, and change things in the environment at each time step. For more details about the functions of the ML-Agents Academy see: Design-Academy.

 

Adding a Brain

Now that we have an Academy, we can add a Brain to the scene. Brains need to be a child object of the scene’s Academy.

  • Select the Academy game object in the Hierarchy Window.
  • Right click and select Create Empty.
  • Rename the child game object “WallPongBrain”.
  • With the WallPongBrain object selected in the Hierarchy Window, click on the Add Component button at the bottom of the Inspector Window. 
  • In the search field at the top of the add component pop-up window type in the word ‘Brain’
  • If you added the ML-Agents assets to the project correctly you should see a “Brain” C# component script appear as a select-able component.
  • Click on the “Brain” C# script component to add it to the WallPongBrain object.
  • NOTE: you can also add the Brain C# component script to the WallPongBrain game object by going to the Assets/ML-Agents/Scripts folder in the Project Window and dragging the Brain C# script onto the WallPongBrain game object in the Hierarchy Window.

Recall that the Brain component specifies what action an Agent should take based on the current state of the environment (i.e., it enacts a control policy). There are three key aspects to the Brain: (1) the type and number of state observations; (2) the type and number of actions; and (3) the type of Brain. If you select the WallPongBrain game object in the Hierarchy Window you should see the relevant parameter settings in the Inspector Window (see below figure).

NOTE: As of ML-Agents v0.5, an ML-Agent Brain using Discrete actions can now have “branches” of concurrent discrete actions. The discrete vector action space, Branches, is an array of integers, with each value corresponding to the number of action possibilities for each branch. This means, for example, if we wanted an Agent that can move in an plane and jump, we could define two branches (one for motion and one for jumping) because we want our agent be able to move and jump concurrently. For wall pong we do not need concurrent actions and so we will just be using one action Branch (see below). Go here for more details.

Observations:

The ML-Agents toolkit lets an Agent observe an environment in two ways: via Vector Observations or via Visual Observations.

  • Vector Observations: correspond to a set of discrete (integer) or continuous (real) values that specify the state of environmental objects and events.
  • Visual Observations: correspond to pixel images generated from the cameras attached to an agent. These pixel images correspond to what the agent can “see” (its point of observation).

Here we are simply going to use continuous Vector Observations. More specifically, the Wall Pong Agent will observe the x (left-right) and z (forward-back) position and velocity of the Ball and the x (left-right) position of the Paddle (i.e., 5 continuous states in total).

Why use Vector Observations rather than Visual Observations? Well, for one, Vector Observations involve significantly less computational cost compared to Visual Observations. In addition, for simple state-action mappings, Vector Observations typically result in much faster training than Visual Observations. More information about ML-Agent Brains and Observations can be found here [Design-Brains] and here [ML-Agents Overview].

To set the the Brain observation parameters:

  • Select the Brain object in the Hierarchy Window.
  • In the Inspector Panel set the Space Type to Continuous.
  • Set the Space Size to 5.
  • Leave Stacked Vectors at the default setting of 1 (go here for information about this parameter).
Actions:

The possible actions an agent can take within a given environment or game are represented as a vector of either Discrete (integer) or Continuous (real) numerical values. Discrete actions correspond to categorically distinct action events (e.g., fire, jump, move left one step), whereas Continuous action values represent a real value scale of some action parameter (e.g., joint-angle, movement velocity or force, motor torque).

As note above, for Wall Pong there are four possible discrete actions that an agent or player can make: do nothing (don’t move), move paddle left (translate left by a fixed step size)move paddle right (translate right by a fixed step size); and fire ball (i.e., start game). These discrete actions will be represented by the integer values 0, 1, 2, and 3 respectively. To set these action parameters:

  • Select the Brain object in the Hierarchy Window.
  • In the Inspector Panel set the Vector Action – Space Type to Discrete.
  • Set the Vector ActionBranches Size to 1.
  • Set the Branch 0 Size to 4.
  • (OPTIONAL) Expand the Branch Descriptions setting; its Size should already be set to 1. Enter a brief description of what each action (element) value corresponds to in the Element 0 input field: 0 = do nothing, 1 = move left, 2 =  move right, 3 = fire ball.
Brain Type:

A Brain can control the behavior of an agent in four different ways. These four ways or “Types” of Brains are as follows:

  • External: Used during agent training. Actions are decided by the external training or learning process defined in the Python API. For instance, the PPO training process, which is the default training process in the ML-Agents toolkit (more on this later).
  • Internal: Used to test agents, post training. Actions are decided using a TensorFlowSharp model (i.e., a trained artificial neural-network model).
  • Player: Used for game testing and (human) user game play. Actions are decided using keyboard input mappings.
  • Heuristic: Actions are decided using a hard-coded policy or custom AI script.

Over the course of this tutorial we will use the first three types of brains (External, Internal and Player). A this stage we are going to set the Brain Type to Player so once we have the Agent component and ball controller setup (see below) we can test the Wall Pong game using keyboard inputs to control the Paddle. Later on we will set the Brain Type to External so we can train an artificial neural-network (TensorFlowSharp model) to control the actions of the Paddle. Finally, we will test a trained TensorFlowSharp model by importing the model into Unity and setting the Brain Type to Internal.

To set the Brain Type to Player:

  • Select the Brain object in the Hierarchy Window.
  • In the Inspector Panel set the Brain Type drop-down list to Player.
  • Set the Default Action to 0 (i.e, do nothing).
  • Expand the Discrete Player Action settings and set the Size field to 3.
  • For Element 0, set the Key to ‘A’ and the Value to 1 (i.e., move left).
  • For Element 1, set the Key to ‘D’ and the Value to 2 (i.e., move right).
  • For Element 2, set the Key to ‘Space’ and the Value to 3 (i.e., fire ball).

 

Creating and Adding the Agent Component

As described above, the Agent component is a custom script that inherits the Agent class from the ML-Agents namespace and is assigned to an ‘Agent’ game object within a Unity Scene. There are three essential functions of the Agent component: (i) collect observations from the environment; (ii) perform actions within the environment; and (iii) assign rewards (positive or negative) when appropriate.

For Wall Pong, the Agent game object is the Paddle. Hence, we need to write a custom Agent script and add it as a component to the Paddle.

  • Select the Scripts folder in the Project Window,
  • Right click in the file/object area and select Create -> C# Script.
  • Rename the script “WallPongPaddleAgent”
  • Double click the script to open it.
  • After the WallPongPaddleAgent.cs script opens in your code editor delete everything (i.e., the default C# code).
  • Copy and past the below code template (we will complete each element and method one by one below).
  • Save the file.
using UnityEngine;
using MLAgents;

public class WallPongPaddleAgent : Agent
{
    // Paddle (Agent) Variables

    // Ball Variables

    // Normalization vectors for observations

    void Start()
    {
        //code will be added here
    }

    public override void CollectObservations()
    {
        //code will be added here
    }

    public override void AgentAction(float[] vectorAction, string textAction)
    {
        //code will be added here
    }

    public override void AgentReset()
    {
        //code will be added here
    }

    void OnCollisionEnter(Collision col)
    {
        //code will be added here
    }
}

 

Note that the above template script defines a public class WallPongPaddleAgent that is derived from the Agent class. Thus, even without having to write any custom code the WallPongPaddleAgent script already inherits (contains) a host of relevant parameters and functionality. To see the derived functionality, go back to Unity:

  • In the Project Window, navigate to the Assets/WallPong/Scripts folder.
  • From the File area of the Project Window, select WallPongPaddleAgent.cs file and drag it onto the Paddle game object in the Hierarchy Window.
  • If you select the Paddle game object in the Hierarchy Window the WallPongPaddleAgent (script) component should now appear in the Inspector Panel.

NOTE: because we are are not using Visual Observations we can ignore the Agent Cameras settings here (i.e., we don’t need to add/remove cameras).  We will not be using any On Demand Decisions either, so we can also ignore this setting (leave unchecked). Go here for information about  Agent Cameras and On Demand Decisions.

Before we customize our WallPongPaddleAgent (script) component, lets go ahead and set up the inherited parameters.

  • The first thing we need to do is to assign a Brain to the WallPongPaddleAgent component.
  • With the Paddle selected in the Hierarchy Window drag the WallPongBrain object into the Brain field in the Inspector Window.

  • Next, set the Max Step value. Max Step defines the number of time steps before an Agent calls its resets function. It can be used to define the maximum length of a game episode for the Agent. If set to 0, the agent will play forever, unless a reset is initiated from the Academy (we do not review Academy resets here).
  • Set the value of Max Step to 3600 (approximately 1 minute of real-time game play).
  • Turn Reset On Done on by checking the Reset On Done checkbox.
  • By checking Reset On Done we are stating that if a game episode ends (i.e., if done = true) before Max Step, then the WallPongPaddleAgent will call its reset function. Here, this means that if the Paddle misses the Ball a game episode will be considered done and the Ball will be reset for another round of game play.
  • The final parameter to set is the Decision Frequency. This defines the number of time-steps between action decisions. It is often better to set this parameter to a value greater than 1. That is, having your agent be too responsive to changes in the environment can sometimes hinder learning. The opposite is also true, however; you don’t want your agent to be too unresponsive either.
  • For Wall Pong, setting the Decision Frequency to 3 should work well.
  • NOTE: setting the Decision Frequency > 1 will result in the agent performing the same action at each time-step between decisions.

 

Customizing the Agent Component (Script)

The first thing we need to add to our WallPongPaddleAgent script are the variables we need for the Paddle (agent) and the Ball. The required variables are listed in the below code snippet. Copy and paste these into the WallPongPaddleAgent script.

// Paddle (Agent) Variables
private Transform trPaddle;
private Rigidbody rbPaddle;
public float translateFactor = 0.1f;
private float maxPaddlePosition;

// Ball Variables
public GameObject Ball;
private Transform trBall;
private Rigidbody rbBall;
  • GameObject Ball: this public variable is included so that the WallPongPaddleAgent script will have a reference to the Ball game object.
  • The private Transform and Rigidbody variables for the Paddle and Ball (i.e., trPaddle, trBall, rbPaddle, rbBall) are simply included to streamline the code when referencing the Paddle and Ball‘s position and velocity.
  • translateFactor: will define how far the Paddle will translate left or right when the move-left or move-right actions are called.
  • maxPaddlePosition: this private variable will be used to set the maximum possible position that the Paddle can move. That is, this variable will ensure that the Paddle stays within the bounds of the game arena.

Most machine-learning or reinforcement-learning methods used for AI training work best when environmental states or observations are normalized (i.e., re-scaled to [0 1]  or [-1 1] ). This is true for the default ML-Agents training process (i.e., PPO) that we will use here. Thus, we also need to add some normalization variables to scale the max and min position and velocity of the Paddle and Ball.

// Normalization for observations
public Vector3 normPosFactor = new Vector3(5.0f, 1.0f, 3.5f);
public float normVelFactor = 5.0f;

  • normPosFactor: this Vector3 will be used to normalize the Paddle’s x position observations and Ball’s x and z position observations. The floating point values of 5.0 and 3.5 correspond to the absolute maximum x and z position of the game arena, respectively.
  • normVelFactor: this float will be used to normalize the Ball’s x and z velocity observation.

Now that we have defined the variables we need, lets add a Start() method to assign the Transform and Rigidbody variables and set the maxPaddlePosition when the Unity game is starts.

void Start()
{
    // Assign paddle transform variables
    trPaddle = this.transform;

    // Assign ball transform and rigidbody variables
    trBall = Ball.transform;
    rbBall = Ball.GetComponent<Rigidbody>();

    // Set Paddle and Ball Constraints
    // Equals width of plane arena (from center) minus the width (from center) of the paddle
    maxPaddlePosition = normPosFactor.x - (trPaddle.localScale.x / 2);
}

 

Customizing the CollectObservations() and AgentAction() Methods

The two central methods to any Agent component are the CollectObservations() and the AgentAction() methods. These methods are derived from the Agent class and we need to override (i.e., modify) them to meet the specific observation and action requirements of the particular environment or game being created. A detailed description of the CollectObservations() and the AgentAction() methods is provided here [ Design-Agents ].

CollectObservations():

As the name indicates, the CollectObservations() method is where we define the environmental or object states the agent will observe. The method is called at each time-step of game play and it is essential that the order of state observations is always the same (i.e. fixed). Given that we are using Vector Observations here, we define or add the relevant state observations within CollectObservations() using the AddVectorObs() method.

public override void CollectObservations()
{
    // Collect current ball x and z position and velocity
    AddVectorObs(trBall.localPosition.x / normPosFactor.x);
    AddVectorObs(trBall.localPosition.z / normPosFactor.z);
    AddVectorObs(rbBall.velocity.x / normVelFactor);
    AddVectorObs(rbBall.velocity.z / normVelFactor);

    // Collect current paddle (agent) x position
    AddVectorObs(trPaddle.localPosition.x / normPosFactor.x);
}

As can be easily discerned from the above code snippet, we add each observed state one after the other. First the Ball’s local x and z position, then the Ball’s x and z velocity, and finally the Paddle’s local x position (note the paddle agent does not need to observe the Paddle’s z position or x or z velocity because the Paddle will only translates along the x-axis). In each case, we normalize the value of each state by dividing it by the corresponding normalization factor.

NOTE that we used the localPosition of the Ball and Paddle for the state observations. This is essential if we want to use multiple game arenas for training. That is, we want the WallPongPaddleAgent to observe the state of the Paddle and Ball relative to its own local game arena.

AgentAction():

The AgentAction() method is where we detail the specific actions an agent will take. Recall that when we setup the WallPongBrain, we specified that there would be four discrete actions [do nothing, move left, move right, and fire ball] and that these actions would be represented by the integer values 0, 1, 2 and 3, respectively. Thus, when the Academy invokes the AgentAction() method, the Brain specifies which action (or actions) the agent should perform by passing an action parameter to the Agent component, namely the action parameter array vectorAction. When using a continuous action space, vectorAction is an array with length equal to the  size of the continuous action space. For a discrete action space, vectorAction is an array containing a single point value, with this value indexing which action should be enacted (e.g., 0 for do nothing, 1 for move left,… etc.).

In the below code snippet, you can see that because we are using a discrete action space for our Wall Pong game, we can convert the single value in the parameter vectorAction to an integer index and then use a Switch statement to specify which action to enact. Note that for the move left and move right actions, we first specify the direction to go (i.e., using the Vector3 dirToGo) and then before translating the Paddle left or right, the future or new position of the paddle is clamped to the + or – maxPaddlePosition, ensuring the Paddle does not move outside the bounds of the game arena.

With regard to firing the ball (i.e., vectorAction[0] = 3), later on in this tutorial we will write a C# script to control the Ball called WallPongBallController.cs. Within this script we will define the FireBall() method. This method will set the ball in motion by adding a velocity force to the ball.

At the top of the method, an if statement is include to check if the ball is still in play. If the Paddle has missed the Ball (i.e., the z position of the ball is less than the z position of the Paddle), the Done() method is called to end the episode and a negative (-1) reward is passed to the Agent using the AddReward() method. Note that the Done() and AddReward() methods are derived form the Agent class.

public override void AgentAction(float[] vectorAction, string textAction)
{
    // If ball missed, add negative reward
    if (trBall.localPosition.z < trPaddle.localPosition.z)
    {
        Done();
        AddReward(-1.0f);
    }

    // Act (0=nothing, 1=move left, 2=move right)
    int action = Mathf.FloorToInt(vectorAction[0]);
    Vector3 dirToGo = Vector3.zero;
    switch (action)
    {
        case 0:
            //Do nothing
            break;
        case 1:
            //Set translate direction to left
            dirToGo = Vector3.left;
            break;
        case 2:
            //Set translate direction to right
            dirToGo = Vector3.right;
            break;
        case 3:
            //Fire ball
            Ball.GetComponent<WallPongBallController>().FireBall();
            break;
        }

    // Update (move) Paddle position, clamping paddle position to max paddle position (arena bounds)
    Vector3 newPosition = trPaddle.localPosition + (dirToGo * translateFactor);
    newPosition.x = Mathf.Clamp(newPosition.x, -maxPaddlePosition, maxPaddlePosition);
    trPaddle.localPosition = newPosition;
}

 

Adding a Reset() Method and a Positive Reward Function

Recall that when we set the initial Agent component settings we checked Reset On Done = true. Thus, when the Done() method is called in the AgentAction() method (when the Paddle misses the Ball), the Agent component’s AgentReset() method is invoked. This method is also derived from the Agent component class and should be overridden to complete any necessary agent reset tasks when either Done() is called or the parameter value of Max Step is reached.

For our Wall Pong game, the only thing we need to reset when the Paddle misses the Ball, is the Ball itself. As noted above, in the next section of this tutorial we will write a C# script to control the Ball called WallPongBallController.cs. In addition to defining a FireBall() method within that script, we will also define a ResetBall() method, which will position the Ball just above the Paddle, ready to be fired. For now, we will assume that this method already exists and will implement the AgentReset() method as follows:

public override void AgentReset()
{
    //Leave paddle (agent) where they are
    //Reset ball position and velocity
    Ball.GetComponent<WallPongBallController>().ResetBall();
}

Finally, given that the aim of the game is to intercept the Ball with the Paddle in order to keep the Ball in play for as long as possible we need to add a positive reward every time the Ball hits the Paddle. The easiest way to do this is to use Unity’s OnCollisionEnter() method, adding a positive (+1) reward whenever the Ball collides with the Paddle.

void OnCollisionEnter(Collision col)
{
    // If paddle (agent) hits ball, add positive reward
    if (col.gameObject.name == "Ball")
    {
        AddReward(1.0f);
    }
}

Creating and Adding the Ball Controller Script

Our game is almost ready, the last thing to do is to write a Ball controller script.

  • Select the Scripts folder in the Project Window.
  • Right click in the file/object area and select Create -> C# Script.
  • Rename the script “WallPongBallController”
  • Double click the script to open it.
  • After the WallPongBallController.cs script opens in your code editor delete everything (i.e., the default C# code).
  • Copy and past the below code template (we will complete each element and method one by one below).
  • Save the file.
using UnityEngine;
public class WallPongBallController : MonoBehaviour
{
    // Ball Variables

    // Paddle Variables

    // Ball variables

    void Start()
    {
        //code will be added here
    }

    // Reset ball position in reference to paddle location
    public void ResetBall()
    {
        //code will be added here
    }

    // Use FixedUpdate to track Paddle position if Ball not fired
    private void FixedUpdate()
    {
        //code will be added here
    }

    // Fire ball to start moving 
    public void FireBall() 
    { 
        //code will be added here 
    }

    // If ball hit paddle, set new ball direction
    float hitFactor(Vector3 ballPos, Vector3 paddlePos, float paddleSize)
    {
        //code will be added here
    }

    // Process Paddle collision
    void OnCollisionEnter(Collision col)
    {
        //code will be added here
    }
}

The essential functionality of this script is three fold: (1) Set or reset the position of the Ball above the Paddle ready to fire; (2) Fire the Ball or set the Ball in motion; and (3) determine the direction that the Ball bounces off the Paddle as a function of where the Ball hits the Paddle. Before adding this functionality, lets first define the various script parameters and variables.

// Ball Variables
private Transform trBall;
private Rigidbody rbBall;

// Paddle Variables
public GameObject paddle;
private Transform trPaddle;
public float paddleOffSet = 0.5f;

// Ball variables
public float ballSpeed = 4;
private bool ballFired = false;

The reason for most of these variables should be relatively obvious. The Transform and Rigidbody variables will be used to streamline references to the position and velocity of the Ball and the position of the Paddle. The public GameObject paddle variable is needed so the WallPongBallController script will have a reference to the Paddle game object. The paddleOffSet parameter is the distance from the Paddle that the Ball will be positioned for firing. ballSpeed will be used to set the speed or velocity of the ball when fired. Finally, ballFired will be used to keep track of whether the Ball has been fired and is in play, or whether the Ball is set and ready to be fired.

Now that we have defined the script variables, lets use the Start() method to assign the Transform and Rigidbody variables for the Ball and Paddle accordingly.

void Start()
{
    // Assign ball transform & rigidbody variables
    trBall = this.transform;
    rbBall = this.GetComponent<Rigidbody>();

    // Assign Paddle transform variable
    trPaddle = paddle.transform;
}

 

Setting or resetting the Ball’s position involves positioning the Ball just above the Paddle at the Paddle’s current local x position. Until the Ball is fired (set in motion) and ballFired = true, the Ball also needs to update its local x position so it moves with the Paddle. That is, if the agent moves the Paddle before the Ball is fired the Ball needs to move with the Paddle. We will implement this across two methods.

First, in the ResetBall() method we will set ballFired = false and make sure the Ball’s velocity is equal to zero.

// Reset ball position in reference to paddle location
public void ResetBall()
{
    //Reset ball fired state
    ballFired = false;

    // Initial Velocity equals 0
    rbBall.velocity = new Vector3(0, 0, 0).normalized;
}

Then in the FixedUpdate() method, if ballFired = false, we will constantly set or reset the Ball’s position just above the middle of the Paddle’s current local x position. FixedUpdate() is automatically invoked by Unity at each fixed time-step of game play and, thus, if the ball is not in play (i.e., not fired) using FixedUpdate() will ensure that the Ball automatically tracks any change in the Paddle’s position (agent actions are also updated at Unity’s fixed update rate).

private void FixedUpdate()
{
    // If ball not fired (game not on)
    // then ball tracks paddle position
    if (ballFired == false)
    {
        // Set initial ball position
        trBall.localPosition = new Vector3(trPaddle.localPosition.x, (trBall.localScale.y / 2), trPaddle.localPosition.z + paddleOffSet);       

        // add small pentaly for not firing ball
        paddle.GetComponent<WallPongPaddleAgent>().AddReward(-0.01f);
    }
}

Finally, it is important to note that we also add a small negative -0.01 reward for each time step the Ball is not in play. Without including this small negative reward, it is possible that an agent could learn to never fire the ball. This possibility is due to the fact that during early learning an agent might learn to associate firing the ball with missing the ball (a reward of -1). Thus, never firing the ball will result in a more positive reward (a reward of 0) than firing the ball. By adding a small penalty for not firing the ball we can ensure that that the agent will learn to fire the Ball and increase the likelihood that the agent will also learn that it can maximize its total or expected reward by hitting the Ball with the Paddle (i.e., keep the Ball in play).

The FireBall() method simply needs to set the ball in motion by changing its velocity. To make the game a little more challenging, we coded the method so that it not only fires the ball away from the Paddle in the  +z direction, but also adds a random amount of velocity in the + and -x direction. This is achieved by first creating a normalized velocityDirection vector3, with the z component equal to 1 and the x component equal to a random value between -1 (left) and 1 (right). We then set the Ball’s Rigidbody’s velocity equal to this velocityDirection vector multiplied by the ballSpeed parameter. Finally, the method sets ballFired = true.

// Fire ball to start moving
public void FireBall()
{
    if (ballFired == false)
    {
        // Calculate initial direction, make length=1 via .normalized
        Vector3 velocityDirection = new Vector3(Random.Range(-1.0f, 1.0f), 0, 1).normalized;

        // Initial Velocity
        rbBall.velocity = velocityDirection * ballSpeed;

        //Set ball fired
        ballFired = true;
    } 
}

 

The last two methods in the WallPongBallController script are used to determine the direction the Ball bounces off the Paddle as a function of where the Ball hits the Paddle. In short,  when the Ball collides with the Paddle, the OnCollisionEnter() method is invoked. This method then calls the hitFactor() method to determine the x velocity component of the Balls bounce direction: when the Ball hits the right side of the Paddle, x will equal a positive value between 0 and 1, setting a right-bounce direction; when the Ball hits the left side of the Paddle, x will equal a negative value between 0 and -1, setting a left-bounce direction; and when the Ball hits the center of the Paddle, x = 0, such that the Ball will bounce directly away from the Paddle.

After determining the x bounce direction using the hitFactor() method, a vecloityDirection vector is the initialized and normalzied with the calculated x direction value and z = 1. Finally, the Ball’s Rigidbody’s velocity is set to this velocityDirection vector multiplied by the ballSpeed parameter.

// Process Paddle collision
void OnCollisionEnter(Collision col)
{
    // Hit the paddle
    if (col.gameObject.name == "Paddle")
    {
        // Calculate hit Factor as defined in below function
        float x = hitFactor(trBall.localPosition,
            col.transform.localPosition,
            col.collider.bounds.size.x);

        // Calculate direction, make length=1 via .normalized
        Vector3 velocityDirection = new Vector3(x, 0, 1).normalized;

        // Set Velocity with dir * speed
        rbBall.velocity = velocityDirection * ballSpeed;
    }
}
// If ball hit paddle, set new ball direction
float hitFactor(Vector3 ballPos, Vector3 paddlePos, float paddleSize)
{
    // ||  1 <- at the right of the paddle
    // ||
    // ||  0 <- at the middle of the paddle
    // ||
    // || -1 <- at the left of the paddle
    return Mathf.Clamp(((ballPos.x - paddlePos.x) / paddleSize), -1.0f, 1.0f);
}

Finalizing and Testing the Game (make sure it works!)

  • Make sure you save the WallPongPaddleAgent.cs and WallPongBallController.cs scripts.
  • Within the Unity editor, drag the WallPongBallController.cs script from the Scripts folder in the Project Window onto the Ball game object in the Hierarchy Window.
  • Select the Ball game object in the Hierarchy Window.
  • Assign the Paddle to the WallPongBallController by dragging the Paddle game object in the Hierarchy Window into the WallPongBallController‘s Paddle input field in the Inspector Window.
  • Select the Paddle in the Hierarchy Window.
  • Assign the Ball to the WallPongPaddleAgent by dragging the Ball game object in the Hierarchy Window into the WallPongPaddleAgent‘s Ball input field in the Inspector Window.
  • Save the scene (File -> Save Scene).
  • Save the project (File -> Save Project).

Play Wall Pong Yourself (as a player)

Recall that when we setup the WallPongBrain, we set the Brain Type = Player. We did this so that we could test the game using the keyboard inputs “A”, “D” and “Space” for move left, move right and fire ball, respectively. So lets make sure everything works the way we expect by pressing the game play button at the top of the Unity Editor and then when the game starts in the Game Window, play a few rounds using the “A”, “D” and “Space” keys. Note you can always reduce the ballSpeed parameter to make the game easier.

Getting the Scene Ready for Reinforcement Learning (RL) Training

We are now ready to train an artificial (neural-network) Agent to play our Wall Pong game using the ML-Agent’s PPO RL process. Before we train an agent, however, we need to modify our game scene to enabled and optimize AI training.

  • First, select the WallPongBrain in the Hierarchy Window and set the Brain Type to External. This essentially specifies that the Brain will be controlled by the external Python API.
  • Next, create a Prefab of the WallPongArena, by selecting the WallPongArena game object in the Hierarchy Window and dragging it down into the WallPong folder in the Project Window.
  • Now that the WallPongArena is a Unity Prefab, we can add several WallPongArena’s to the scene to dramatically speed up training.
  • Go back to the Hierarchy Window and rename the original WallPongArena, WallPongArena (1).
  • Duplicate theWallPongArena (1) 8 times, using the keyboard command Control-D (command-D on a Mac). This should result in a total of nine WallPongArena’s, named WallPongArena (1), WallPongArena (2), WallPongArena (3), …, WallPongArena(9).
  • Set the position transform (location) of each arena as follows:
  • WallPongArena (1): x = 0, y = 0, z = 0
  • WallPongArena (2): x = -12, y = 0, z = 0
  • WallPongArena (3): x = 12, y = 0, z = 0
  • WallPongArena (4): x = 0, y = 0, z = -12
  • WallPongArena (5): x =-12, y = 0, z = -12
  • WallPongArena (6): x = 12, y = 0, z = -12
  • WallPongArena (7): x = 0, y = 0, z = 12
  • WallPongArena (8): x = -12, y = 0, z = 12
  • WallPongArena (9): x = 12, y = 0, z = 12
  • Save the scene (File -> Save Scene).
  • Save the project (File -> Save Project).

If you added everything correctly, your Unity Editor should now look something like this:

NOTE: Why add multiple Wall Pong Arena’s? Well, by having multiple Wall Pong arena’s the Agent is able to learn from the multiple game episodes simultaneously. Thus, our WallPongAgent “experiences” more at each time step of training and learns faster. Keep in mind that there is always a trade off between adding more game/environment prefabs (i.e., Wall Pong Arena’s in this case) and simulation speed. We chose to add nine WallPongArean’s here because this significantly increases training time without any noticeable impact on simulation speed.

Agent Training Using PPO

We will use the default PPO process that comes with the ML-Agents toolkit to train our Wall Pong agent. Proximal Policy Optimization or PPO is a state-of-the-art RL algorithm that uses an artificial neural network to approximate the optimal state-action policy. It is the default RL method for ML-Agents (as well as openAI) and can be employed for RL tasks that involve discrete or continuous action spaces. A detailed discussion of the PPO algorithm is beyond the scope of this tutorial. However, if you are interested in learning more about PPO we encourage you to visit the following:

For ML-Agents, the PPO algorithm is implemented using Tensorflow. The learning process is run via the Python API that can communicate with both a compiled Unity application or the Unity Editor. This Python <-> Unity process means that you can use the provided PPO algorithm for RL, as well as write your own custom RL or ML algorithms. You can also setup several worker-ids or virtual-ports/sockets to train multiple (compiled) Unity applications/environments simultaneously. We do not detail these more advanced possibilities here, but more information can be found at Training ML-Agents.

Setting the Training Hyperparameters

Before executing the ML-Agents PPO training process, we first need to set the training hyperparameters in the trainer_config.yaml file. These hyperparameters include the size (number of nodes and layers) of the artificial neural network that will be employed for training, the size of the memory buffer used for training, the training batch size, the learning rate, the max number of training steps, etc. A detailed description of the hyperparameters that can be set in the ML-Agents trainer_config.yaml file can be found here: PPO Hyperparameters.

The ML-Agents trainer_config.yaml actually has a default set of hyperparameters than can be used ‘out-of-the-box’ so to speak, but it is always better to create your own brain/agent specific set of hyperparameters within the trainer_config.yaml file and modify these rather than the default settings to increase or stabilize training performance. To create a set of WallPongBrain hyperparameters:

  • Open up a file explorer window (a finder window on Mac) and navigate to where you saved the ML-Agents toolkit.
  • Find the trainer_config.yaml file in the yourpath/ml-agents/python folder.
  • Open the trainer_config.yaml file using a script and text editor (e.g., Visual Studio).
  • At the top of the file you will find the default hyperparameter list.
  • Either directly under this default list or at the end of the file create a new list call WallPongBrain.
  • Note that the name of the hyperparameters list needs to match the Brain name used in the Unity Environment.
  • Enter the hyperparameter names and settings described below and illustrated in the adjacent figure.
  • Save the file.
  • (OPTIONAL): As of ML-Agents v0.5 you can also create and save your own config files. For example, if you download the Adventures in Unity ML-Agents Examples Package from our Github page, you will see that the repository includes a config file called aiumla_config.yaml. This file contains all of the hyperparameters settings for the included brains (i.e., catchball, wallpong, pong).

Given that we are using the standard PPO algorithm, with Vector Observations and a fully connected, dense neural network we only need to specify the following hyperparameters (note that the below descriptions and recommended settings are from the Unity ML-Agents documentation on PPO Hyperparameters).

  • batch_size – is the number of experiences used for one iteration of a gradient descent update. This should always be a fraction of buffer_size. If you are using a continuous action space, this value should be large (in the order of 1000s). If you are using a discrete action space, this value should be smaller (in order of 10s or 100s).Typical Range for discrete action spaces: 32 to 512. Typical Range for continuous action spaces: 512 to 5120. Here we are using a discrete action space and will set batch_size = 64
  • buffer_size – corresponds to how many experiences (agent observations, actions and rewards obtained) should be collected (“remembered”) before we do any learning or updating of the model. This should be a multiple of batch_size and scaled as a function of number of steps (i.e., max_steps) required for training. Typically a larger buffer_size relative to max_steps corresponds to more stable training updates. Typical Range: 2048 to 409600. Here we will set buffer size = 10240 (also the default setting).
  • beta – corresponds to the strength of the entropy regularization, which makes the policy “more random.” This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. This should be adjusted such that the entropy (measurable from TensorBoard) slowly decreases alongside increases in reward. If entropy drops too quickly, increase beta. If entropy drops too slowly, decrease beta. Typical Range: 1e-4 to 1e-2. Here we will set beta = 5e-3 (also the default setting).
  • epsilon – corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. Typical Range:: 0.1 to 0.3. Here we will leave epsilon = 0.2 (also the default setting).
  • gamma – corresponds to the discount factor for future rewards. This can be thought of as how far into the future the agent should care about possible rewards. In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be larger. In cases when rewards are more immediate, it can be smaller.Typical Range: 0.8 to 0.995. Here we will set gamma = 0.99 (also the default setting).
  • lambda – is the parameter used when calculating the Generalized Advantage Estimate. This can be thought of as how much the agent relies on its current value estimate when calculating an updated value estimate. Low values correspond to relying more on the current value estimate (which can be high bias), and high values correspond to relying more on the actual rewards received in the environment (which can be high variance). The parameter provides a trade-off between the two, and the right value can lead to a more stable training process. Typical Range: 0.9 to 0.95. Here we will set lambda = 0.95 (also the default setting).
  • learning_rate – corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase.Typical Range: 1e-5 to 1e-3. Here we will set learning_rate = 1e-3
    max_steps – corresponds to how many steps of the simulation (multiplied by frame-skip) are run during the training process. This value should be increased for more complex problems. Typical Range: 1e5 to 1e7. here we will set max_steps = 1e5 (100,000 steps).
  • num_epoch – is the number of passes through the experience buffer during gradient descent. The larger the batch_size the larger it is acceptable to make this. Decreasing this will ensure more stable updates, at the cost of slower learning. Typical Range: 3 to 10. Here we will set num_epoch = 3 (also the default setting).
  • hidden_units – correspond to how many units are in each fully connected layer of the neural network. For simple problems where the correct action is a straightforward combination of the observation inputs, this should be small. For problems where the action is a very complex interaction between the observation variables, this should be larger. Typical Range: 32 to 512. Because this is a simple task, with a small number of inputs and outputs, we will set hidden_units = 64
  • num_layers – corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation. For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems. Typical range: 1 to 3. Here we will set num_layers = 2 (also the default setting).
  • summary_freq – the frequency in steps that the agents progress in terms of average score (reward) received is posted to the command window during training. This should be set larger when max-steps is larger or when environments include sparse rewards. Typical range 1000 to 20,000. Here we will set summary_freq = 2000

Executing the Training Process

To run the PPO training processes, we need to run the ML-Agents training program learn.py from a command prompt. As detailed at the beginning of this tutorial, here we assume that you have installed the ml-agents toolkit in a virtual (conda) environment called “ml-agents” using Anaconda. Although you can run the training process using a compiled (standalone) Unity application, which is beneficial if you want to run multiple training instances simultaneously, for the sake of simplicity we will complete the training process here by playing the game directly in the Unity Editor.

  • Open up an Anaconda Prompt.
  • If you installed ML-Agents in a virtual environment, activate that virtual (conda) environment; e.g., enter the command: conda activate ml-agents
  • Change the directory to the python sub-directory of the ml-agents directory. e.g., enter the command: cd yourpath/ml-agents
  • To start the learning/training process, we will call the learn.py script.
  • We include the –train command to specify we will be training an agent.
  • We will also set a —run-id so that the Tensorflow checkpoint summary and model files generated during training have a unique identifier in the file names. You can use the run-id command to ensure you don’t overwrite previous training files.
  • Enter the following command to start the training process: mlagents-learn config/trainer_config.yaml –run-id=wpT1 –train  and press ENTER. (Optional) if you created you own config file, use the “path/name” of that file instead of config/trainer_config.yaml.
  • The latter command entry is displayed below, as well as the subsequent Unity settings information that should appear in the Anaconda command prompt.

  • Note the instructions that appear in the last line of the window: Start training by pressing the Play button in the Unity Editor. 
  • Thus, to get things started go back the Unity Editor and press the Play button.
  • Once the game starts you should see the Agent Paddle begin to learn!
  • Agent progress, that is, the Agent’s average score at each summary frequency update (i.e., every 2000 steps), will be displayed in the Anaconda prompt.
  • Although we are training the agent for 100,000 steps (1e5 steps), the agent should learning to play Wall Pong (i.e., solve the task and reach maximum reward) within 30,000 to 50,000 steps.

 

Importing and Testing a Trained Model

After training ends, the final step is to test the trained model in Unity. To do this, we first need to import the trained model into our Unity project.

  • Open up an file explorer window (a finder window on a Mac) and navigate to where you saved the ML-Agents toolkit.
  • In the yourpath/ml-agents/python/models/ you should find a directory (folder) with the run-id name you specified when you executed training. For example, if you set the run-id=wpT1, the folder will be named wpT1.
  • Open the directory (folder) and find the bytes file with the corresponding run-id name at the end of the filename. Because we trained the agent directly in the Unity Editor, the file should have a name like “editor_Academy_wpT1.bytes”.
  • Copy or drag the bytes file into the TF_Files folder in the Unity Editor’s Project Window.
  • (Optional) Rename the bytes file “wallpongT1.bytes”.

The bytes file is a Tensorflow Graph file of the trained model. Thus, now that we have the model graph and weights imported into Unity we can add the model to our WallPongBrain so it can play the game using the state-action policy specified by the model.

  • Select the WallPongBrain object in the Hierarchy Window and set Brain Type to Internal.
  • Drag the model bytes file (e.g., wallpongT1.bytes) into the Graph Model input field in the Brain (script) component in the Inspector panel (as shown in the figure below).
  • Save the scene (File -> Save Scene).
  • Save the project (File -> Save Project).
  • Press the Play button at the top of the Unity Editor and watch your agent play.
  • If everything went according to plan, your Paddle Agent should never miss a ball.

What’s Next?

We hope that by completing this getting started tutorial you have developed a basic understanding of how to use the ML-Agents toolkit. Of course, we have only scratched the surface of what is possible with the Unity ML-Agents toolkit. Moving forward we recommend the following:

  • Review the other simple game environments included in the AiUMLA Getting Started Package (i.e., Catch Ball and Pong).
  • Review the example environments provided in the ML-Agents toolkit.
  • Lean how to view and review the training process using Tensorboard [Using-Tensorboard].
  • Modify the WallPong (Catch or Pong) Agents and Brains to use a continuous action space.
  • Modify the WallPong (Catch or Pong) Agents and Brains to learn using pixel data.
  • Review current and future posts on the Adventures in Unity ML-Agents (AiUMLA) blog for more adventures, advanced tutorials, tips and tricks.
  • And of course… start creating your own Ml-Agents Unity environment or game!
  • Let us know what you create and we might add a post about it.