-
Notifications
You must be signed in to change notification settings - Fork 4.4k
StrikerVsGoalie and other self-play env improvements #3699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-Authored-By: Vincent-Pierre BERGES <[email protected]>
This reverts commit 1d3e1cf.
| else | ||
| { | ||
| // Existential penalty cumulant for Generic | ||
| timePenalty += -1f / 3000f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this way, I accumulate timestep penalty and give it all at once at the end of the episode if an agent wins. The point here is so that there's never additional utility in 'losing faster' i.e. if each agent got timestep penalty irrespective of outcome but still utility in winning faster. The trained soccertwos agents look much more stable defensively and I attribute that to this.
ervteng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
vincentpierre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not yet loaded the scene to see for myself but I left some initial feedback. I will take another pass later.
Project/Assets/ML-Agents/Examples/Soccer/Scripts/AgentSoccer.cs
Outdated
Show resolved
Hide resolved
| else | ||
| { | ||
| // Existential penalty cumulant for Generic | ||
| timePenalty += -1f / 3000f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only accumulate this time penalty for the "generic" position. Is that intended or were you trying to give it to the strikers ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was intended. The strikers deserve it every timestep
| if (position == Position.Goalie) | ||
| { | ||
| // Existential bonus for Goalies. | ||
| AddReward(1f / 3000f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a const rather than copy 1f / 3000f in these 3 places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is this number dependent on the agent max_step ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a private float that's set in Initialize using maxStep.
| } | ||
| if (c.gameObject.CompareTag("ball")) | ||
| { | ||
| // Generic gets curriculum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That only the generic agent gets curriculum. I can remove the if statement because the default value of m_BallTouch will be zero since it's set using curriculum.
| the ball from entering own goal. | ||
| * Agents: The environment contains four agents, with the same | ||
| Behavior Parameters : SoccerTwos. | ||
| * Agent Reward Function (dependent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does dependent mean ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agents share the reward function.
Proposed change(s)
This PR adds the StrikersVsGoalie env as well as some refactors to Tennis and Soccer
Tennis:Removed the net causing a loss of a point on the serve. I think the agents were developing weird biases and I think it was partially the culprit behind the boring game playAdded a timestep penalty. The agents are now much more aggressive.Reserving tennis changes for a separate PR
Soccer
Finally, I tweaked the hyperparameters for all these envs to be a bit more stable in training.
TODO:
Add Soccer and Tennis brains (currently retraining with curriculum/no net but training with the original envs yielded good results).Add to docs for new StrikerVsGoalie envAdd to self-play docs regarding curriculum recommendation in specifying rewardsUseful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)
Types of change(s)
Checklist
Other comments