-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Asymmetric self-play #3653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asymmetric self-play #3653
Changes from 23 commits
e19b038
3335cc8
49f5cf4
33ff2ff
3f69db7
e19f9e5
4e1e139
1741c54
cc17ea1
43417e1
124f886
8778cec
33c5ea9
c2eea64
bd86108
82bdfc4
cb855db
fb5ccd0
c3890f5
cad0a2d
f68f7aa
4c9ba86
ffe2cfd
c2ae207
7e0ff7b
d2dd975
6aae133
d560b5f
2bf9271
29435bb
97f1b7d
d123fe7
2cb5a2d
27e924e
f3332c3
aca54be
0e52b20
95469d2
10bd9dd
01f9de3
61649ea
972ed63
7e0a3ba
6c5342d
9149413
02455a4
df8b87f
1333fb9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,21 +1,40 @@ | ||||||
| # Training with Self-Play | ||||||
|
|
||||||
| ML-Agents provides the functionality to train symmetric, adversarial games with [Self-Play](https://openai.com/blog/competitive-self-play/). | ||||||
| A symmetric game is one in which opposing agents are *equal* in form and function. In reinforcement learning, | ||||||
| this means both agents have the same observation and action spaces. | ||||||
| With self-play, an agent learns in adversarial games by competing against fixed, past versions of itself | ||||||
| to provide a more stable, stationary learning environment. This is compared | ||||||
| to competing against its current self in every episode, which is a constantly changing opponent. | ||||||
| ML-Agents provides the functionality to train both symmetric and asymmetric adversarial games with | ||||||
| [Self-Play](https://openai.com/blog/competitive-self-play/). | ||||||
| A symmetric game is one in which opposing agents are equal in form, function snd objective. Examples of symmetric games | ||||||
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| are Tennis and Soccer. In reinforcement learning, this means both agents have the same observation and | ||||||
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| action spaces and learn from the same reward function and so *they can share the same policy*. In asymmetric games, | ||||||
| this is not the case. Examples of asymmetric games are Hide and Seek or Strikers vs Goalie in Soccer. Agents in these | ||||||
| types of games do not always have the same observation or action spaces and so sharing policy networks is not | ||||||
| necessarily ideal. Fortunately, both of these situations are supported with only a few extra command line | ||||||
| arguments and trainer configurations! | ||||||
|
|
||||||
| With self-play, an agent learns in adversarial games by competing against fixed, past versions of its opponent | ||||||
| (which could be itself as in symmetric games) to provide a more stable, stationary learning environment. This is compared | ||||||
| to competing against the current, best opponent in every episode, which is constantly changing (because it's learning). | ||||||
|
|
||||||
| Self-play can be used with our implementations of both [Proximal Policy Optimization (PPO)](Training-PPO.md) and [Soft Actor-Critc (SAC)](Training-SAC.md). | ||||||
| However, from the perspective of an individual agent, these scenarios appear to have non-stationary dynamics because the opponent is often changing. | ||||||
| This can cause significant issues in the experience replay mechanism used by SAC. Thus, we recommend that users use PPO. For further reading on | ||||||
| this issue in particular, [see this paper](https://arxiv.org/pdf/1702.08887.pdf). | ||||||
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| For more general information on training with ML-Agents, see [Training ML-Agents](Training-ML-Agents.md). | ||||||
| For more algorithm specific instruction, please see the documentation for [PPO](Training-PPO.md) or [SAC](Training-SAC.md). | ||||||
|
|
||||||
| Self-play is triggered by including the self-play hyperparameter hierarchy in the trainer configuration file. Detailed description of the self-play hyperparameters are contained below. Furthermore, to distinguish opposing agents, set the team ID to different integer values in the behavior parameters script on the agent prefab. | ||||||
|
|
||||||
|  | ||||||
|
|
||||||
| See the trainer configuration and agent prefabs for our Tennis environment for an example. | ||||||
| ***Team ID must be 0 or an integer greater than 0. Negative numbers will cause unpredictable behavior.*** | ||||||
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| In symmetric games, since all agents (even on opposing teams) will share the same policy, they should have the same 'Behavior Name' in their | ||||||
| Behavior Parameters Script. In asymmetric games, they should have a different Behavior Name in their Behavior Parameters script. | ||||||
| Note, in asymmetric games, the agents must have both different Behavior Names *and* different team IDs! Then, specify the trainer configuration | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this means you can't have If it's a removable restriction, don't let it block this PR, but can you log a jira for followup?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will be something we support when we introduce a true multiagent trainer i.e. multiple behavior names that are on the same team. |
||||||
| for each Behavior Name in your scene as you would normally, and remember to include the self-play hyperparameter hierarchy! | ||||||
|
|
||||||
| For examples of how to use this feature, you can see the trainer configurations and agent prefabs for our Tennis, Soccer and Strikers Vs Goalie environments. | ||||||
| Tennis and Soccer provide examples of symmetric games whereas Strikers Vs Goalie provides an example of an asymmetric game. | ||||||
|
|
||||||
|
|
||||||
| ## Best Practices Training with Self-Play | ||||||
|
|
||||||
|
|
@@ -24,7 +43,8 @@ issues faced by reinforcement learning. In general, the tradeoff is between | |||||
| the skill level and generality of the final policy and the stability of learning. | ||||||
| Training against a set of slowly or unchanging adversaries with low diversity | ||||||
| results in a more stable learning process than training against a set of quickly | ||||||
| changing adversaries with high diversity. With this context, this guide discusses the exposed self-play hyperparameters and intuitions for tuning them. | ||||||
| changing adversaries with high diversity. With this context, this guide discusses | ||||||
| the exposed self-play hyperparameters and intuitions for tuning them. | ||||||
|
|
||||||
|
|
||||||
| ## Hyperparameters | ||||||
|
|
@@ -37,31 +57,68 @@ The ELO calculation (discussed below) depends on this final reward being either | |||||
|
|
||||||
| The reward signal should still be used as described in the documentation for the other trainers and [reward signals.](Reward-Signals.md) However, we encourage users to be a bit more conservative when shaping reward functions due to the instability and non-stationarity of learning in adversarial games. Specifically, we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing) and to allow for more iterations of training to compensate for the sparsity of reward. | ||||||
|
|
||||||
| ### Team Change | ||||||
|
|
||||||
| The `team-change` ***command line argument*** corresponds to the number of *trainer_steps* between switching the learning team. So, | ||||||
| if you run with the command line flag `--team-change=200000`, the learning team will change every `200000` trainer steps. This ensures each team trains | ||||||
| for precisely the same number of steps. Note, this is not specified in the trainer configuration yaml file, but as a command line argument. | ||||||
|
|
||||||
| A larger value of `team-change` will allow the agent to train longer against it's opponents. The longer an agent trains against the same set of opponents | ||||||
| the more able it will be to defeat them. However, training against them for too long may result in overfitting to the particular opponent strategies | ||||||
| and so the agent may fail against the next batch of opponents. | ||||||
|
|
||||||
| The value of `team-change` will determine how many snapshots of the agent's policy are saved to be used as opponents for the other team. So, we | ||||||
| recommend setting this value as a function of the `save_steps` parameter which is discussed in the next section. | ||||||
|
|
||||||
| Recommended Range : 4x-10x where x=`save_steps` | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of specifying the value here, would it be easier to specify it as a multiple of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have strong feelings either way. If you think that's more intuitive then that works for me.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not strong feelings, I just think it makes it easier to twist one knob at a time, instead of having to twist 2 in unison. |
||||||
|
|
||||||
| ### Save Steps | ||||||
|
|
||||||
| The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps`=10000 then a snapshot of the current policy will be saved every 10000 trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. | ||||||
| The `save_steps` parameter corresponds to the number of *trainer steps* between snapshots. For example, if `save_steps=10000` then a snapshot of the current policy will be saved every `10000` trainer steps. Note, trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. | ||||||
|
|
||||||
| A larger value of `save_steps` will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. As a result, the agent trains against a wider variety of opponents. Learning a policy to defeat more diverse opponents is a harder problem and so may require more overall training steps but also may lead to more general and robust policy at the end of training. This value is also dependent on how intrinsically difficult the environment is for the agent. | ||||||
|
|
||||||
| Recommended Range : 10000-100000 | ||||||
|
|
||||||
| ### Swap Steps | ||||||
|
|
||||||
| The `swap_steps` parameter corresponds to the number of *trainer steps* between swapping the opponents policy with a different snapshot. As in the `save_steps` discussion, note that trainer steps are counted per agent. For more information, please see the [migration doc](Migrating.md) after v0.13. | ||||||
|
|
||||||
| The `swap_steps` parameter corresponds to the number of *ghost steps* between swapping the opponents policy with a different snapshot. | ||||||
| This occurs when the team of this agent is not learning. A 'ghost step' refers | ||||||
| to a step taken by an agent *that is following a fixed policy* i.e. is not the learning agent. The reason for this distinction is that in asymmetric games, | ||||||
| we may have teams with an unequal number of agents e.g. the 2v1 scenario in our Strikers Vs Goalie environment. The team with two agents collects | ||||||
| twice as many agent steps per environment step as the team with one agent. Thus, these two values will need to be distinct to ensure that the same number | ||||||
| of trainer steps corresponds to the same number of opponent swaps for each team. The formula for `swap_steps` if | ||||||
| a user desires `x` swaps of a team with `num_agents` agents against an opponent team with `num_opponent_agents` | ||||||
| agents during `team-change` total steps is: | ||||||
|
|
||||||
| ``` | ||||||
| swap_steps = (num_agents / num_opponent_agents) * (team_change / x) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we be doing the math in the code? I think math is hard...
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't know how many agents are on each team
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't we know the number of steps coming in for each team though? Also, would this need to be different if the agents and opponents were running at different decision intervals? |
||||||
| ``` | ||||||
|
|
||||||
| As an example, in our Strikers Vs Goalie environment, if we want the swap to occur `x=4` times during `team-change=200000` steps, | ||||||
| the `swap_steps` for the team of one agent is: | ||||||
|
|
||||||
| ``` | ||||||
| swap_steps = (1 / 2) * (200000 / 4) = 25000 | ||||||
| ``` | ||||||
| The `swap_steps` for the team of two agents is: | ||||||
| ``` | ||||||
| swap_steps = (2 / 1) * (200000 / 4) = 100000 | ||||||
| ``` | ||||||
| Note, with equal team sizes, the first term is equal to 1 and `swap_steps` can be calculated by just dividing the total steps by the desired number of swaps. | ||||||
|
|
||||||
| A larger value of `swap_steps` means that an agent will play against the same fixed opponent for a longer number of training iterations. This results in a more stable training scenario, but leaves the agent open to the risk of overfitting it's behavior for this particular opponent. Thus, when a new opponent is swapped, the agent may lose more often than expected. | ||||||
|
|
||||||
| Recommended Range : 10000-100000 | ||||||
|
|
||||||
| ### Play against current self ratio | ||||||
| ### Play against current best ratio | ||||||
|
|
||||||
| The `play_against_current_self_ratio` parameter corresponds to the probability | ||||||
| an agent will play against its ***current*** self. With probability | ||||||
| 1 - `play_against_current_self_ratio`, the agent will play against a snapshot of itself | ||||||
| from a past iteration. | ||||||
| The `play_against_current_best_ratio` parameter corresponds to the probability | ||||||
| an agent will play against the current opponent. With probability | ||||||
| 1 - `play_against_current_best_ratio`, the agent will play against a snapshot of its | ||||||
|
||||||
| 1 - `play_against_current_best_ratio`, the agent will play against a snapshot of its | |
| `1 - play_against_current_best_ratio`, the agent will play against a snapshot of its |
vincentpierre marked this conversation as resolved.
Show resolved
Hide resolved
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,36 +1,46 @@ | ||
| from typing import Dict, NamedTuple | ||
| from typing import NamedTuple | ||
| from urllib.parse import urlparse, parse_qs | ||
|
|
||
|
|
||
| class BehaviorIdentifiers(NamedTuple): | ||
andrewcoh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| name_behavior_id: str | ||
| """ | ||
| BehaviorIdentifiers is a named tuple if the identifiers that uniquely distinguish | ||
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| an agent encountered in the trainer_controller. The named tuple consists of the | ||
| fully qualified behavior name, the name of the brain name (corresponds to trainer | ||
| in the trainer controller) and the team id. In the future, this can be extended | ||
| to support further identifiers. | ||
| """ | ||
|
|
||
| behavior_id: str | ||
| brain_name: str | ||
ervteng marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| behavior_ids: Dict[str, int] | ||
| team_id: int | ||
|
|
||
| @staticmethod | ||
| def from_name_behavior_id(name_behavior_id: str) -> "BehaviorIdentifiers": | ||
| """ | ||
| Parses a name_behavior_id of the form name?team=0¶m1=i&... | ||
| Parses a name_behavior_id of the form name?team=0 | ||
| into a BehaviorIdentifiers NamedTuple. | ||
| This allows you to access the brain name and distinguishing identifiers | ||
| without parsing more than once. | ||
| This allows you to access the brain name and team id og an agent | ||
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| :param name_behavior_id: String of behavior params in HTTP format. | ||
| :returns: A BehaviorIdentifiers object. | ||
| """ | ||
|
|
||
| ids: Dict[str, int] = {} | ||
| if "?" in name_behavior_id: | ||
| name, identifiers = name_behavior_id.rsplit("?", 1) | ||
| if "&" in identifiers: | ||
| list_of_identifiers = identifiers.split("&") | ||
| else: | ||
| list_of_identifiers = [identifiers] | ||
|
|
||
| for identifier in list_of_identifiers: | ||
| key, value = identifier.split("=") | ||
| ids[key] = int(value) | ||
| else: | ||
| name = name_behavior_id | ||
|
|
||
| parsed = urlparse(name_behavior_id) | ||
| name = parsed.path | ||
| ids = parse_qs(parsed.query) | ||
| team_id: int = 0 | ||
| if "team" in ids: | ||
| team_id = int(ids["team"][0]) | ||
| return BehaviorIdentifiers( | ||
| name_behavior_id=name_behavior_id, brain_name=name, behavior_ids=ids | ||
| behavior_id=name_behavior_id, brain_name=name, team_id=team_id | ||
| ) | ||
|
|
||
|
|
||
| def create_name_behavior_id(name: str, team_id: int) -> str: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this used anywhere? Would it be better as a method (or property) of BehaviorIdentifiers?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's used here and here. In both instances, it's used so that the correct policies are pushed onto the correct queues if the learning team changes right before a swap. I'm not sure it's appropriate to be a method/property of BehaviorIdentifiers because it's not really operating on data contained in a BehaviorIdentifier tuple. |
||
| """ | ||
| Reconstructs fully qualified behavior name from name and team_id | ||
| :param name: brain name | ||
| :param team_id: team ID | ||
| :return: name_behavior_id | ||
| """ | ||
| return name + "?team=" + str(team_id) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| from typing import Deque, Dict | ||
| from collections import deque | ||
| from mlagents.trainers.ghost.trainer import GhostTrainer | ||
|
|
||
|
|
||
| class GhostController(object): | ||
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
andrewcoh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| """ | ||
| GhostController contains a queue of team ids. GhostTrainers subscribe to the GhostController and query | ||
| it to get the current learning team. The GhostController cycles through team ids every 'swap_interval' | ||
| which corresponds to the number of trainer steps between changing learning teams. | ||
andrewcoh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
|
|
||
| def __init__(self, swap_interval: int, maxlen: int = 10): | ||
| """ | ||
| Create a GhostController. | ||
| :param swap_interval: Number of trainer steps between changing learning teams. | ||
| :param maxlen: Maximum number of GhostTrainers allowed in this GhostController | ||
| """ | ||
|
|
||
| self._swap_interval = swap_interval | ||
| # Tracks last swap step for each learning team because trainer | ||
| # steps of all GhostTrainers do not increment together | ||
| self._last_swap: Dict[int, int] = {} | ||
| self._queue: Deque[int] = deque(maxlen=maxlen) | ||
| self._learning_team: int = -1 | ||
| # Dict from team id to GhostTrainer | ||
| self._ghost_trainers: Dict[int, GhostTrainer] = {} | ||
|
|
||
| def subscribe_team_id(self, team_id: int, trainer: GhostTrainer) -> None: | ||
| """ | ||
| Given a team_id and trainer, add to queue and trainers if not already. | ||
| The GhostTrainer is used later by the controller to get ELO ratings of agents. | ||
| :param team_id: The team_id of an agent managed by this GhostTrainer | ||
| :param trainer: A GhostTrainer that manages this team_id. | ||
| """ | ||
| if team_id not in self._ghost_trainers: | ||
| self._ghost_trainers[team_id] = trainer | ||
| self._last_swap[team_id] = 0 | ||
| if self._learning_team < 0: | ||
| self._learning_team = team_id | ||
| else: | ||
| self._queue.append(team_id) | ||
|
|
||
| def get_learning_team(self, step: int) -> int: | ||
| """ | ||
| Returns the current learning team. If 'swap_interval' steps have elapsed, the current | ||
| learning team is added to the end of the queue and then updated with the next in line. | ||
| :param step: Current step of the trainer. | ||
| :return: The learning team id | ||
| """ | ||
| if step >= self._swap_interval + self._last_swap[self._learning_team]: | ||
| self._last_swap[self._learning_team] = step | ||
| self._queue.append(self._learning_team) | ||
| self._learning_team = self._queue.popleft() | ||
| return self._learning_team | ||
|
|
||
| # Adapted from https://github.com/Unity-Technologies/ml-agents/pull/1975 and | ||
| # https://metinmediamath.wordpress.com/2013/11/27/how-to-calculate-the-elo-rating-including-example/ | ||
| # ELO calculation | ||
| # TODO : Generalize this to more than two teams | ||
| def compute_elo_rating_changes(self, rating: float, result: float) -> float: | ||
| """ | ||
| Calculates ELO. Given the rating of the learning team and result. The GhostController | ||
| queries the other GhostTrainers for the ELO of their agent that is currently being deployed. | ||
| Note, this could be the current agent or a past snapshot. | ||
| :param rating: Rating of the learning team. | ||
| :param result: Win, loss, or draw from the perspective of the learning team. | ||
| :return: The change in ELO. | ||
| """ | ||
| opponent_rating: float = 0.0 | ||
| for team_id, trainer in self._ghost_trainers.items(): | ||
| if team_id != self._learning_team: | ||
| opponent_rating = trainer.get_opponent_elo() | ||
| r1 = pow(10, rating / 400) | ||
| r2 = pow(10, opponent_rating / 400) | ||
|
|
||
| summed = r1 + r2 | ||
| e1 = r1 / summed | ||
|
|
||
| change = result - e1 | ||
| for team_id, trainer in self._ghost_trainers.items(): | ||
| if team_id != self._learning_team: | ||
| trainer.change_opponent_elo(change) | ||
|
|
||
| return change | ||
Uh oh!
There was an error while loading. Please reload this page.