Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 28 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,9 @@ Unity ML-Agents (Machine Learning Agents) is an open-source Unity plugin that al
</p>
<br>

Trainng with ML-Agents is described [here](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Training-ML-Agents.md#training-with-mlagents-learn)

Training with ML-Agents is further described [here](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Training-ML-Agents.md#training-with-mlagents-learn)


### PackerHand Agent
Each time a box needs to be packed from the spawning area into the bin, the agent:
Expand Down Expand Up @@ -213,11 +215,19 @@ Our policy is online and thus we want our agent to backpropagate and update its
For more exaplanation of the various parameters in the .yaml file see also
the [training config file](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Training-ML-Agents.md#training-ml-agents) section.

#### Policy
Implement transformers (decision transformer / set transformer) in a multi-agent environment
#### Observations

### Proximal Policy Optimization (PPO) policy
ML-Agents provide an implementation of two reinforcement learning algorithms:

- Proximal Policy Optimization (PPO)
- Soft Actor-Critic (SAC)
The default algorithm is PPO. This is a method that has been shown to be more general purpose and stable than many other RL algorithms.

In contrast with PPO, SAC is off-policy, which means it can learn from experiences collected at any time during the past. This a main reason that PPO is implemented for PackerHand. PPO in an on-policy which means taht it learns directly from real-time experiences collected from the environment. For more information on the PPO, see [here](https://openai.com/research/openai-baselines-ppo).

### Observations
Observations are the information our agent gets from the environment.
#### Actions
### Actions
The Action space is the set of all possible actions in an environment. The action of our agent come from a discrete environment. Every time the agent is called to make a decision, simulteously the 3 following actions are decided:

1. The available positions vector:
Expand All @@ -234,15 +244,15 @@ The Action space is the set of all possible actions in an environment. The actio

Masking of actions is aso implemented preventing boxes that are already packed to be available in the action selection.

#### Rewards
### Rewards
Shape/Tune reward towards a more sparse behavior
#### Attention mechanism
### Attention mechanism

#### Memory-enhancement using RNN
### Memory-enhancement using RNN
##### [](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Feature-Memory.md)
Deciding what the agents should remember in order to solve a task is not easy to do by hand, but our training algorithm can learn to keep track of what is important to remember with LSTM. To use the LSTM, training "remembers" a sequence of experiences instead of single experiences. The downside is that the training of the agents slows down.

#### Curriculum learning
### Curriculum learning
Curriculum learning is a way of training a machine learning model where more difficult aspects of a problem are gradually introduced in such a way that the model is always optimally challenged. This idea has been around for a long time, and it is how we humans typically learn. If you imagine any childhood primary school education, there is an ordering of classes and topics. Arithmetic is taught before algebra, for example. Likewise, algebra is taught before calculus. The skills and knowledge learned in the earlier subjects provide a scaffolding for later lessons. The same principle can be applied to machine learning, where training on easier tasks can provide a scaffolding for harder tasks in the future.

The [Wall Jump](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#wall-jump) example shows a simple implementation of Curriculum Learning with Unity ML-Agents.
Expand All @@ -251,23 +261,27 @@ For PackerHand we tested our agent with curriculum learning. For curriculum to b

The Curriculum Learning lessons are configured in the [.yaml file](https://github.com/bryanat/Reinforcement-Learning-Unity-3D-Packing/tree/master/Assets/ML-Agents/packerhand/Models).

#### Multi-platform training
### Multi-platform training

Multi-platform - With multi-platform, we found PPO performs better overall with more consistency, better convergence, and improved stability and speed using 1-2 platforms per CPU core with added GPU power. Having parallel environments also gives us the capability to set up different box sets on different platforms for greater data variability

<br>
<p align = "center" draggable=”false” ><img src="VSCode/docs/images/drl-unity-api-io-sensor-actuator.png"
<p align = "center" draggable=”false” ><img src="VSCode/docs/images/multiplatform.png"
width="400px"
height="auto"/>
</p>
<br>

<!-- ![](VSCode/docs/images/drl-unity-api-io-sensor-actuator.png) -->

## Tensorboard
## Viewing results with Tensorboard
The ML-Agents Toolkit saves statistics during learning session that you can view with a TensorFlow utility named, TensorBoard. Check [here](https://unity-technologies.github.io/ml-agents/Using-Tensorboard/) on how to perform visualization of training results.

The figure below is a Tensorboard dashboard showing the results of one of the very first successful PackerHand trainings. It is obvious that as the episodes number progresses, the agent collects monotoneously more rewards. This means that the agent is continuously learning how to pack boxes better.

![](VSCode/docs/images/runidfffffffff.png)

## Training workflow
## Training loop in Unity ML-Agents

The workflow is the following:
1. State S0
Expand Down Expand Up @@ -295,7 +309,7 @@ The workflow is the following:
- More than 15000 steps have been utilized per episode (negative reward)

<br>
<p align = "center" draggable=”false” ><img src="VSCode/docs/images/scribble-policy-state-action-reward.png"
<p align = "center" draggable=”false” ><img src="VSCode/docs/images/rl_model.png"
width="400px"
height="auto"/>
</p>
Expand Down
Binary file added VSCode/docs/images/multiplatform.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added VSCode/docs/images/rl_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.