diff --git a/README.md b/README.md index bbd5bdcf..b0317468 100644 --- a/README.md +++ b/README.md @@ -174,7 +174,9 @@ Unity ML-Agents (Machine Learning Agents) is an open-source Unity plugin that al


-Trainng with ML-Agents is described [here](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Training-ML-Agents.md#training-with-mlagents-learn) + +Training with ML-Agents is further described [here](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Training-ML-Agents.md#training-with-mlagents-learn) + ### PackerHand Agent Each time a box needs to be packed from the spawning area into the bin, the agent: @@ -213,11 +215,19 @@ Our policy is online and thus we want our agent to backpropagate and update its For more exaplanation of the various parameters in the .yaml file see also the [training config file](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Training-ML-Agents.md#training-ml-agents) section. -#### Policy -Implement transformers (decision transformer / set transformer) in a multi-agent environment -#### Observations + +### Proximal Policy Optimization (PPO) policy +ML-Agents provide an implementation of two reinforcement learning algorithms: + +- Proximal Policy Optimization (PPO) +- Soft Actor-Critic (SAC) +The default algorithm is PPO. This is a method that has been shown to be more general purpose and stable than many other RL algorithms. + +In contrast with PPO, SAC is off-policy, which means it can learn from experiences collected at any time during the past. This a main reason that PPO is implemented for PackerHand. PPO in an on-policy which means taht it learns directly from real-time experiences collected from the environment. For more information on the PPO, see [here](https://openai.com/research/openai-baselines-ppo). + +### Observations Observations are the information our agent gets from the environment. -#### Actions +### Actions The Action space is the set of all possible actions in an environment. The action of our agent come from a discrete environment. Every time the agent is called to make a decision, simulteously the 3 following actions are decided: 1. The available positions vector: @@ -234,15 +244,15 @@ The Action space is the set of all possible actions in an environment. The actio Masking of actions is aso implemented preventing boxes that are already packed to be available in the action selection. -#### Rewards +### Rewards Shape/Tune reward towards a more sparse behavior -#### Attention mechanism +### Attention mechanism -#### Memory-enhancement using RNN +### Memory-enhancement using RNN ##### [](https://github.com/gzrjzcx/ML-agents/blob/master/docs/Feature-Memory.md) Deciding what the agents should remember in order to solve a task is not easy to do by hand, but our training algorithm can learn to keep track of what is important to remember with LSTM. To use the LSTM, training "remembers" a sequence of experiences instead of single experiences. The downside is that the training of the agents slows down. -#### Curriculum learning +### Curriculum learning Curriculum learning is a way of training a machine learning model where more difficult aspects of a problem are gradually introduced in such a way that the model is always optimally challenged. This idea has been around for a long time, and it is how we humans typically learn. If you imagine any childhood primary school education, there is an ordering of classes and topics. Arithmetic is taught before algebra, for example. Likewise, algebra is taught before calculus. The skills and knowledge learned in the earlier subjects provide a scaffolding for later lessons. The same principle can be applied to machine learning, where training on easier tasks can provide a scaffolding for harder tasks in the future. The [Wall Jump](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#wall-jump) example shows a simple implementation of Curriculum Learning with Unity ML-Agents. @@ -251,12 +261,12 @@ For PackerHand we tested our agent with curriculum learning. For curriculum to b The Curriculum Learning lessons are configured in the [.yaml file](https://github.com/bryanat/Reinforcement-Learning-Unity-3D-Packing/tree/master/Assets/ML-Agents/packerhand/Models). -#### Multi-platform training +### Multi-platform training Multi-platform - With multi-platform, we found PPO performs better overall with more consistency, better convergence, and improved stability and speed using 1-2 platforms per CPU core with added GPU power. Having parallel environments also gives us the capability to set up different box sets on different platforms for greater data variability
-

@@ -264,10 +274,14 @@ Multi-platform - With multi-platform, we found PPO performs better overall with -## Tensorboard +## Viewing results with Tensorboard +The ML-Agents Toolkit saves statistics during learning session that you can view with a TensorFlow utility named, TensorBoard. Check [here](https://unity-technologies.github.io/ml-agents/Using-Tensorboard/) on how to perform visualization of training results. + +The figure below is a Tensorboard dashboard showing the results of one of the very first successful PackerHand trainings. It is obvious that as the episodes number progresses, the agent collects monotoneously more rewards. This means that the agent is continuously learning how to pack boxes better. + ![](VSCode/docs/images/runidfffffffff.png) -## Training workflow +## Training loop in Unity ML-Agents The workflow is the following: 1. State S0 @@ -295,7 +309,7 @@ The workflow is the following: - More than 15000 steps have been utilized per episode (negative reward)
-

diff --git a/VSCode/docs/images/multiplatform.png b/VSCode/docs/images/multiplatform.png new file mode 100644 index 00000000..4892de7e Binary files /dev/null and b/VSCode/docs/images/multiplatform.png differ diff --git a/VSCode/docs/images/rl_model.png b/VSCode/docs/images/rl_model.png new file mode 100644 index 00000000..077ec5a4 Binary files /dev/null and b/VSCode/docs/images/rl_model.png differ