How does AI or in our case a robot (or as we call them in the field, an agent) learn to interact with the world? Well, there are multiple ways of teaching an AI to find the optimal course of actions in a given state. Three to be exact: Supervised Learning, Unsupervised Learning and Reinforcement Learning. In this blog post, I will outline reinforcement learning from a birds-eye view through the example of the AWS Deepracer.
What is the AWS Deepracer?
The Deepracer is a 1/18th scale autonomous car released by Amazon. You can find more information about it here; but in essence it’s a robot car that can perform simple tasks such as turning, accelerating, braking and seeing through a camera. It is a tool aimed at developers and students to learn about Reinforcement Learning.
We at COMPUTD first came across this device on the 2019 Big Data Expo in Utrecht, and were fascinated with. Seeing it, we wanted to try and beat the board! I will detail our approach to the training a bit later in the post.
So what is Reinforcement Learning?
Reinforcement Learning is a discipline of Machine Learning where an agent is rewarded (or punished) for the actions it takes and for the environment that action creates. An agent has a set of actions it can perform at any time and an interpreter evaluates the action the agent took in a given state and assigns a reward. The agent wants to maximize the amount of reward it gathers by the end of the task or alternatively, wants to minimize the punishment it receives for not dealing with the task as the interpreter wanted.
The above picture summarizes this process quite clearly. The agent performs an action that influences the environment (or the agent’s state), from which the interpreter can decide what reward it should give the agent. The interpreter also provides the agent with the state it is in, if it’s necessary, or lets the agent make another observation about the environment. The agent then starts performing random actions and gets reset once it messed up enough/finished the task/ran out of time. This will start a new episode, where the agent will remember everything it learned in the past, making the actions less and less random, as the agent is trying to maximize its own rewards.
Why not just let the interpreter perform the actions? It knows what the best course of action is in a state, right?
Not necessarily. The interpreter might be able to evaluate a given state of the environment but does not necessarily know how to get to the next best state. The interpreter works with what’s called a Reward Function: a programming function that looks at the current state and returns a number – the reward – back to the agent. In some cases, this function is robust and takes into account every possible state the environment and the agent can be in, but in some cases, it’s as simple as asking whether the agent is closer to the goal or not.
How does this work in practice?
The AWS Deepracer has a simulator that comes with a couple of tracks and a comprehensive guide that details states and actions. This simulator can be used to train the car, so that we don’t risk damaging it and we don’t have to constantly reset it to the start line.
We can set the car’s turn angle, the steering angle granularity, or how many intervals we want to split the turns up into, its max speed, and the speed granularity.
These splits describe our action space, or (in this case) the states the car can get into after performing a steering and accelerating move. Taking action one, will result in the car steering -30 degrees (so moving to the left), and accelerating or decelerating to 0.67 m/s.
There is also certain information that we can get from the interpreter about the status of the car and the environment. These can be directly about the car, such as speed or heading, or the car’s location on the track, such as progress or distance from the centre of the track. We can also see more complex information, such as the next waypoints the car can reach, or whether all four wheels are on the track.
These can be treated as inputs to the reward function so that the car can be rewarded or punished for the actions it takes in the training. When designing the reward function, one can take into account any number of these inputs, and potentially get the car to drive around the track. For example, you can use the combination of progress and speed, and reward if the car progresses, fast. After a sufficient amount of training, this should get the car around the track pretty fast, and it is a very minimalistic approach to designing a reward function.
In our attempt, we decided to simulate the racing line real, human racers would take through the track. My colleague, Dirk Bongers has quite some experience with simulated racing, so he designed an optimal line around the track. We used the waypoints that the environment contains, and rewarded the agent when he was on the correct side of the track at each waypoint, and we rewarded more if it was also at the optimal speed. A visualization of how the waypoints are located on the track can be found here.
This Reward function penalizes if the car is on the wrong side of the track, and also punishes if it does not drive the optimal speed. In green zones, where the car should be driving with maximum speed, the slower it goes the bigger penalty it receives and in the red zone, where it should be driving slow to take a corner, it gets punished the faster it goes.
I hope this gave you some insight into how Reinforcement Learning works, and how you can use it on the AWS Deepracer. In following articles we will discuss Supervised and Unsupervised Learning as well.
Oh, and how did the trained car perform on the Big Data Expo? Well, we managed to get a sub 16 second time, and the top spot on the board: