Skip to content

Retry on libssh SSH_AGAIN return code #756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

justin-stephenson
Copy link

@justin-stephenson justin-stephenson commented Jul 30, 2025

SUMMARY

When a low SSH options timeout value is set, we see sometimes that calls to new_channel() and ssh_channel_open_session fail when libssh returns SSH_AGAIN. Currently, pylibssh returns an exception:


../../../../pytest-mh/pytest_mh/conn/ssh.py:285: in _run
    self.__channel = self.__conn.new_channel()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
src/pylibsshext/session.pyx:514: in pylibsshext.session.Session.new_channel
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>   ???
E   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]
src/pylibsshext/channel.pyx:71: LibsshChannelException

SSH_AGAIN return code is documented https://api.libssh.org/master/group__libssh__channel.html#gaf051dd30d75bf6dc45d1a5088cf970bd

It is not clearly stated this but SSH_AGAIN also happens due to timeout.

ssh_channel_open_session()

Returns
    SSH_OK on success, SSH_ERROR if an error occurred, SSH_AGAIN if in nonblocking mode and call has to be done again.
ISSUE TYPE
  • Bugfix Pull Request
ADDITIONAL INFORMATION

This issue happens in our https://github.com/next-actions/pytest-mh project.

CC @pbrezina

Copy link

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo yum install -y dnf-plugins-core on RHEL 8
  • sudo dnf install -y dnf-plugins-core on Fedora
  • dnf copr enable packit/ansible-pylibssh-756
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

1 similar comment
Copy link

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo yum install -y dnf-plugins-core on RHEL 8
  • sudo dnf install -y dnf-plugins-core on Fedora
  • dnf copr enable packit/ansible-pylibssh-756
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Jul 30, 2025
Comment on lines 2 to 3
``ssh_userauth_password`` are now retried when libssh returns SSH_AGAIN.
:user:`justin-stephenson`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``ssh_userauth_password`` are now retried when libssh returns SSH_AGAIN.
:user:`justin-stephenson`.
``ssh_userauth_password`` are now retried when ``libssh`` returns ``SSH_AGAIN``
-- by :user:`justin-stephenson`.

@@ -0,0 +1,3 @@
The ``Channel`` class calls to libssh ``ssh_channel_open_session`` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure classes don't call things. They represent states. Could you rephrase?

Copy link
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could result in an infinite loop..

Also, I don't think a proof that this works as intended without tests. Add them.

@KB-perByte KB-perByte self-requested a review August 6, 2025 09:42
@Jakuje
Copy link
Contributor

Jakuje commented Aug 6, 2025

I wonder if this could result in an infinite loop..

Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue.

Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with ssh_is_connected() or something.

Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself?

@justin-stephenson
Copy link
Author

Thank you for your comments and review.

I wonder if this could result in an infinite loop..

Also, I don't think a proof that this works as intended without tests. Add them.

I tried to add a test for this but unfortunately I couldn't reproduce the scenario we see in the test environment. I'll try again.

I wonder if this could result in an infinite loop..

Yes, this would go into infinite loop if the server dies and does not properly disconnect. And timeouts are to handle this issue.

Retrying unconditionally and infinitely is ok for tests, but for real-world application, the pylibssh should do at very least some check with ssh_is_connected() or something.

Or setting some limit how many times you could retry. But in this case, why not raise the timeout itself?

I can change this PR to add a call to ssh_is_connected() to avoid an infinite loop, or I can raise a different exception when SSH_AGAIN is returned (like LibsshChannelAgain) then we will handle this exception in our calls to pylibssh methods.

Whichever you prefer is acceptable for us, just let me know and i'll make those changes.

@Jakuje
Copy link
Contributor

Jakuje commented Aug 7, 2025

I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use ssh_set_blocking() or doing some variation of ssh_channel_read_nonblocking(), but I see your changes completely elsewhere, this should not come into the effect and I do not see these functions exposed in the pylibssh either.

But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed.

What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts?

@justin-stephenson
Copy link
Author

I am actually wondering how you are getting the SSH_AGAIN in these two places with pylibssh. The sessions in libssh are blocking by default. The only way to change the session to non-blocking mode is to use ssh_set_blocking() or doing some variation of ssh_channel_read_nonblocking(), but I see your changes completely elsewhere, this should not come into the effect and I do not see these functions exposed in the pylibssh either.

The error we currently see in our PRCI is specific to ssh_channel_open_session failure:

FAILED tests/test_authentication.py::test_authentication__user_login_with_overriding_home_directory[domain] (ldap) - pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

I added the session: commit just as a nice to have because ssh_userauth_password() can return SSH_AGAIN per the libssh API docs, but the channel.pyx commit is the main issue we are hitting currently.

But there might be the oddness that setting low timeout might actually return the SSH_AGAIN in places where it should not, according to the documentation, which would be a bug in libssh that needs to be fixed.

What brought you initially to set smaller timeouts? Is a viable workaround to raise the timeouts?

In our code we set .set_ssh_options("timeout", 1) because in our pytest-mh code we allow users to to execute commands over SSH on hosts with an arbitrary timeout value set, such as:

client.host.conn.run(..., timeout=X)

If I understand correctly, setting this low set_ssh_options("timeout")" value is necessary for the above to work as expected because Python will not deliver signal if the code is blocked in C library The signal is delivered only after we get back to the Python code. @pbrezina can correct me here.

-- for reference https://github.com/next-actions/pytest-mh/blob/master/pytest_mh/conn/ssh.py

justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
Setting a low SSH options timeout value can lead to

LibsshChannelException: Failed to open_session: [-2]

Attempt to retry when this occurs.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 11, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 12, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
justin-stephenson added a commit to justin-stephenson/sssd that referenced this pull request Aug 12, 2025
Workaround ansible pylibssh issue which causes test failures

   pylibsshext.errors.LibsshChannelException: Failed to open_session: [-2]

PR ansible/pylibssh#756 is under review
but workaround it in the meantime.
@Jakuje
Copy link
Contributor

Jakuje commented Aug 14, 2025

Ok, setting the libssh timeout is the timeout you are giving to the libssh to return to you. but if you are setting the low timeout to get the signals delivered, then either pylibssh or the caller needs to retry. The pylibssh code is really not written to support the retries around here so my proposal would be to create some pylibssh timeout/retry counter to avoid infinite cycle when stuff will go wrong. What do you think?

It can be either separate pyblissh option, or it can be somehow intercepted when we set the libssh timeout to set it to some multiply of the user specified value to return the handling to the python code. Or the second option by default with possible override.

And obviously we need some tests with this option, otherwise its untested broken code. I bet we can get some slow CI runners where this would demonstrate from time to time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants