Comments

Reinforecment Learning with AWS DeepRacer (AWS ML Scholarship)

 

Reinforcement learning is used in a variety of fields to solve real-world problems. It’s particularly useful for addressing sequential problems with long-term goals. Let’s take a look at some examples.

  • RL is great at playing games:
    • Go (board game) was mastered by the AlphaGo Zero software.
    • Atari classic video games are commonly used as a learning tool for creating and testing RL software.
    • StarCraft II, the real-time strategy video game, was mastered by the AlphaStar software.
  • RL is used in video game level design:
    • Video game level design determines how complex each stage of a game is and directly affects how boring, frustrating, or fun it is to play that game.
    • Video game companies create an agent that plays the game over and over again to collect data that can be visualized on graphs.
    • This visual data gives designers a quick way to assess how easy or difficult it is for a player to make progress, which enables them to find that “just right” balance between boredom and frustration faster.
  • RL is used in wind energy optimization:
    • RL models can also be used to power robotics in physical devices.
    • When multiple turbines work together in a wind farm, the turbines in the front, which receive the wind first, can cause poor wind conditions for the turbines behind them. This is called wake turbulence and it reduces the amount of energy that is captured and converted into electrical power.
    • Wind energy organizations around the world use reinforcement learning to test solutions. Their models respond to changing wind conditions by changing the angle of the turbine blades. When the upstream turbines slow down it helps the downstream turbines capture more energy.
  • Other examples of real-world RL include:
    • Industrial robotics
    • Fraud detection
    • Stock trading
    • Autonomous driving
Some examples of real-world RL include:   Industrial robotics, Fraud detection, Stock trading, and Autonomous driving

Some examples of real-world RL include: Industrial robotics, fraud detection, stock trading, and autonomous driving

New Terms

  • Agent: The piece of software you are training is called an agent. It makes decisions in an environment to reach a goal.
  • Environment: The environment is the surrounding area with which the agent interacts.
  • Reward: Feedback is given to an agent for each action it takes in a given state. This feedback is a numerical reward.
  • Action: For every state, an agent needs to take an action toward achieving its goal.

Agent

  • The piece of software you are training is called an agent.
  • It makes decisions in an environment to reach a goal.
  • In AWS DeepRacer, the agent is the AWS DeepRacer car and its goal is to finish * laps around the track as fast as it can while, in some cases, avoiding obstacles.

Environment

  • The environment is the surrounding area within which our agent interacts.
  • For AWS DeepRacer, this is a track in our simulator or in real life.

State

  • The state is defined by the current position within the environment that is visible, or known, to an agent.
  • In AWS DeepRacer’s case, each state is an image captured by its camera.
  • The car’s initial state is the starting line of the track and its terminal state is when the car finishes a lap, bumps into an obstacle, or drives off the track.

Action

  • For every state, an agent needs to take an action toward achieving its goal.
  • An AWS DeepRacer car approaching a turn can choose to accelerate or brake and turn left, right, or go straight.

Reward

  • Feedback is given to an agent for each action it takes in a given state.
  • This feedback is a numerical reward.
  • A reward function is an incentive plan that assigns scores as rewards to different zones on the track.

Episode

  • An episode represents a period of trial and error when an agent makes decisions and gets feedback from its environment.
  • For AWS DeepRacer, an episode begins at the initial state, when the car leaves the starting position, and ends at the terminal state, when it finishes a lap, bumps into an obstacle, or drives off the track.

In a reinforcement learning model, an agent learns in an interactive real-time environment by trial and error using feedback from its own actions. Feedback is given in the form of rewards.


 Different algorithms have different strategies for going about this.

  • A soft actor critic (SAC) embraces exploration and is data-efficient, but can lack stability.
  • A proximal policy optimization (PPO) is stable but data-hungry.
  • Average reward
    • This graph represents the average reward the agent earns during a training iteration. The average is calculated by averaging the reward earned across all episodes in the training iteration. An episode begins at the starting line and ends when the agent completes one loop around the track or at the place the vehicle left the track or collided with an object. Toggle the switch to hide this data.
  • Average percentage completion (training)
    • The training graph represents the average percentage of the track completed by the agent in all training episodes in the current training. It shows the performance of the vehicle while experience is being gathered.
  • Average percentage completion (evaluation)
    • While the model is being updated, the performance of the existing model is evaluated. The evaluation graph line is the average percentage of the track completed by the agent in all episodes run during the evaluation period.
  • Best model line
    • This line allows you to see which of your model iterations had the highest average progress during the evaluation. The checkpoint for this iteration will be stored. A checkpoint is a snapshot of a model that is captured after each training (policy-updating) iteration.
  • Reward primary y-axis
    • This shows the reward earned during a training iteration. To read the exact value of a reward, hover your mouse over the data point on the graph.
  • Percentage track completion secondary y-axis
    • This shows you the percentage of the track the agent completed during a training iteration.
  • Iteration x-axis
    • This shows the number of iterations completed during your training job.


Graphic shows elements of a reward graph

List of reward graph parts and what they do

Reward Graph Interpretation

The following four examples give you a sense of how to interpret the success of your model based on the reward graph. Learning to read these graphs is as much of an art as it is a science and takes time, but reviewing the following four examples will give you a start.

Needs more training

In the following example, we see there have only been 600 iterations, and the graphs are still going up. We see the evaluation completion percentage has just reached 100%, which is a good sign but isn’t fully consistent yet, and the training completion graph still has a ways to go. This reward function and model are showing promise, but need more training time.


Graph of model that needs more training

Needs more training

No improvement

In the next example, we can see that the percentage of track completions haven’t gone above around 15 percent and it's been training for quite some time—probably around 6000 iterations or so. This is not a good sign! Consider throwing this model and reward function away and trying a different strategy.

The reward graph of a model that is not worth keeping.

No improvement

A well-trained model

In the following example graph, we see the evaluation percentage completion reached 100% a while ago, and the training percentage reached 100% roughly 100 or so iterations ago. At this point, the model is well trained. Training it further might lead to the model becoming overfit to this track.


Avoid overfitting

Overfitting or overtraining is a really important concept in machine learning. With AWS DeepRacer, this can become an issue when a model is trained on a specific track for too long. A good model should be able to make decisions based on the features of the road, such as the sidelines and centerlines, and be able to drive on just about any track.

An overtrained model, on the other hand, learns to navigate using landmarks specific to an individual track. For example, the agent turns a certain direction when it sees uniquely shaped grass in the background or a specific angle the corner of the wall makes. The resulting model will run beautifully on that specific track, but perform badly on a different virtual track, or even on the same track in a physical environment due to slight variations in angles, textures, and lighting.


This model had been overfit to a specific track.

Well-trained - Avoid overfitting

Adjust hyperparameters

The AWS DeepRacer console's default hyperparameters are quite effective, but occasionally you may consider adjusting the training hyperparameters. The hyperparameters are variables that essentially act as settings for the training algorithm that control the performance of your agent during training. We learned, for example, that the learning rate controls how many new experiences are counted in learning at each step.

In this reward graph example, the training completion graph and the reward graph are swinging high and low. This might suggest an inability to converge, which may be helped by adjusting the learning rate. Imagine if the current weight for a given node is .03, and the optimal weight should be .035, but your learning rate was set to .01. The next training iteration would then swing past optimal to .04, and the following iteration would swing under it to .03 again. If you suspect this, you can reduce the learning rate to .001. A lower learning rate makes learning take longer but can help increase the quality of your model.

This model's hyperparameters need to be adjusted.

Adjust hyperparameters

Good Job and Good Luck!

Remember: training experience helps both model and reinforcement learning practitioners become a better team. Enter your model in the monthly AWS DeepRacer League races for chances to win prizes and glory while improving your machine learning development skills!

 

Share on Google Plus

About Inas AL-Kamachy

    Blogger Comment
    Facebook Comment

0 Comments:

Post a Comment