-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Clips actions to large limits before applying them to the environment #984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clips actions to large limits before applying them to the environment #984
Conversation
|
Slightly related issue: #673 |
source/extensions/omni.isaac.lab/omni/isaac/lab/envs/direct_rl_env_cfg.py
Outdated
Show resolved
Hide resolved
source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env_cfg.py
Outdated
Show resolved
Hide resolved
Mayankm96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also change the gym.spaces range to these bounds?
|
In my opinion, RL libraries should take care of this and no Isaac Lab. |
|
@Toni-SM I agree. We should move this to the environment wrappers (similar to what we do for RL-Games): Regarding, the action/obs space design for the environments, I think it is better to do that as its own separate thing. The current fix in this MR is at least critical for the continuous learning tasks as users otherwise get "NaNs" from the simulation due to the policy feedback loop (large action into observations that then lead to larger action predictions - which eventually cause the sim to go unstable). So I'd prefer if we don't block this fix itself. |
| Please refer to the :class:`omni.isaac.lab.utils.noise.NoiseModel` class for more details. | ||
| """ | ||
|
|
||
| action_bounds: list[float] = [-100, 100] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious where is [-100, 100] from? I wonder if it's best to leave this user-specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 100 limits, comes from our internal codebase, still from legged gym.
I was considering having it None or Inf by default, but then users need to consciously set this value, and I think most people that have training stability issues will probably not think about that.
Could set it to None and add a FAQ to the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can set to -inf, inf
…_env_cfg.py Co-authored-by: Mayank Mittal <[email protected]> Signed-off-by: renezurbruegg <[email protected]>
…ased_env_cfg.py Co-authored-by: Mayank Mittal <[email protected]> Signed-off-by: renezurbruegg <[email protected]>
|
@renezurbruegg Would you be able to help move the changes to the wrappers? |
|
This will introduce "arbitrary" bounds of -100,100 to any new user that merges this PR, which could lead to unexpected behaviour. How should this be addressed? In my opinion there are three options:
I personally prefer option (3). |
|
Please, note that current implementation is in conflict with #1117 for the direct workflow |
|
Can these changes here directly be integrated in #1117 then? |
|
@renezurbruegg , as I commented previously, in my opinion the RL libraries should take care of this and no Isaac Lab. For example, using skrl you can set model parameter However, if the target library is not able to take care of that, the option number 3 (which will not prevent the training from throwing an exception after a certain time of execution) you mentioned, or the clipping of the action directly in the task implementation for critical cases, could be a solution. |
|
This MR is important. However, I agree with Toni's remark that the RL library should handle it. I have made a new MR for the RL wrapper environment: #2019 Closing this MR as it has become stale. |
# Description Currently, the actions from the policy are directly applied to the environment and also often fed back to the policy using the last action as observation. Doing this can lead to instability during training since applying a large action can introduce a negative feedback loop. More specifically, applying a very large action leads to a large last_action observations, which often results in a large error in the critic, which can lead to even larger actions being sampled in the future. This PR aims to fix this for RSL-RL library, by clipping the actions to (large) hard limits before applying them to the environment. This prohibits the actions from growing continuously and greatly improves training stability. Fixes #984, #1732, #1999 ## Type of change - Bug fix (non-breaking change which fixes an issue) - New feature (non-breaking change which adds functionality) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
# Description Currently, the actions from the policy are directly applied to the environment and also often fed back to the policy using the last action as observation. Doing this can lead to instability during training since applying a large action can introduce a negative feedback loop. More specifically, applying a very large action leads to a large last_action observations, which often results in a large error in the critic, which can lead to even larger actions being sampled in the future. This PR aims to fix this for RSL-RL library, by clipping the actions to (large) hard limits before applying them to the environment. This prohibits the actions from growing continuously and greatly improves training stability. Fixes #984, #1732, #1999 ## Type of change - Bug fix (non-breaking change which fixes an issue) - New feature (non-breaking change which adds functionality) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
# Description Currently, the actions from the policy are directly applied to the environment and also often fed back to the policy using the last action as observation. Doing this can lead to instability during training since applying a large action can introduce a negative feedback loop. More specifically, applying a very large action leads to a large last_action observations, which often results in a large error in the critic, which can lead to even larger actions being sampled in the future. This PR aims to fix this for RSL-RL library, by clipping the actions to (large) hard limits before applying them to the environment. This prohibits the actions from growing continuously and greatly improves training stability. Fixes isaac-sim#984, isaac-sim#1732, isaac-sim#1999 ## Type of change - Bug fix (non-breaking change which fixes an issue) - New feature (non-breaking change which adds functionality) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
Description
Currently, the actions from the policy are directly applied to the environment and also often fed back to the policy using the last action as observation.
Doing this, can lead to instability during training, since applying a large action can introduce a negative feedback loop.
More specifically, applying a very large action leads to a large last_action observations, which often results in a large error in the critic, which can lead to even larger actions being sampled in the future.
This PR aims to fix this, by clipping the actions to (large) hard limits before applying them to the environment. This prohibits the actions from growing continuously and - in my case - greatly improves training stability.
Type of change
TODO
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there