Skip to content

Conversation

stephenchengCloud
Copy link
Collaborator

This commit adds idle timeout feature for vnc console connections.

Key changes:

  • Add idle timeout detection by monitoring RFB keyEvent and pointerEvent.
  • Add callback function to proxy to parse the RFB messages and determine if the connection is idle or not.

This commit adds idle timeout feature for vnc console connections.

Key changes:
- Add idle timeout detection by monitoring RFB keyEvent and
  pointerEvent.
- Add callback function to `proxy` to parse the RFB messages and
  determine if the connection is idle or not.

Signed-off-by: Stephen Cheng <[email protected]>
@stephenchengCloud
Copy link
Collaborator Author

This is a re-open PR of #6667 based on the rfb parser.
Tested:
Set idletimeout = 60s

Sep 26 07:23:22 eu1-dt094 xapi: [debug||621 HTTPS 10.70.0.164->:::80|Connection to VM console R:b967a8c7f944|console] Connected; running proxy (between fds: 55 a        nd 56)
Sep 26 07:23:22 eu1-dt094 xenopsd-xc: [debug||7 ||device] Got QMP event, domain-20: VNC_CONNECTED
…

Sep 26 07:24:28 eu1-dt094 xapi: [debug||621 HTTPS 10.70.0.164->:::80|Connection to VM console R:b967a8c7f944|console] Idle timeout exceeded! (timeout: 60.0s)
Sep 26 07:24:28 eu1-dt094 xapi: [debug||621 HTTPS 10.70.0.164->:::80|Connection to VM console R:b967a8c7f944|console] Proxy exited
Sep 26 07:24:28 eu1-dt094 xenopsd-xc: [debug||7 ||device] Got QMP event, domain-20: VNC_DISCONNECTED

exception Close_proxy

let proxy ?(should_close : (bytes * int * int -> bool) option = None)
?(poll_timeout = -1) (a : Unix.file_descr) (b : Unix.file_descr) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter poll_timeout is optional. Should we not need special values to indicate that we have no timeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a consequence of the interface of polly, which expects a -1 for "no timeout".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, you can define should_close without option = None (as the ? makes it an option type already). Then you don't need to use Some when you call the function in console.ml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This should be a parameter with a default but it should not be an optional value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proxy is not only used by console. I made poll_timeout as optional because I don't want to modify the code that calls proxy in other places before.

(* The timeout (in seconds) for event polling in the proxy loop.
If set to a positive value, the poll will wake up periodically,
which is useful for implementing features like idle timeout or periodic inspection of proxy buffers. *)
let poll_timeout_sec = ref 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be a float value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout value is for polly.wait_fold. It should be an int.
Check:
https://github.com/lindig/polly/blob/master/lib/polly.mli#L127

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected value at the Polly API is an int in milliseconds - so we could take a float in seconds and convert it to milliseconds when we pass it to Polly.

Copy link
Member

@psafont psafont Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an Mtime.Span.t, also the current name doesn't make it clear it's for a proxy

Suggested change
let poll_timeout_sec = ref 5
let proxy_poll_period_timeout = ref Mtime.Span.(5 * s)

)
; ( "poll_timeout_sec"
, Arg.Set_int poll_timeout_sec
, (fun () -> string_of_int !poll_timeout_sec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a float.

exception Close_proxy

let proxy ?(should_close : (bytes * int * int -> bool) option = None)
?(poll_timeout = -1) (a : Unix.file_descr) (b : Unix.file_descr) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a consequence of the interface of polly, which expects a -1 for "no timeout".

Comment on lines 435 to 441
match should_close with
| Some cb ->
let buf, read_len, offset = CBuf.read b' b in
if cb (buf, read_len, offset) then
close_proxy ()
| None ->
ignore (CBuf.read b' b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shorter version should work too:

let data = CBuf.read b' b in
Option.iter (fun cb ->
  if cb data then close_proxy ()
) should_close

Comment on lines 443 to 448
( match should_close with
| Some cb ->
if cb (Bytes.empty, 0, 0) then
close_proxy ()
| None ->
()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option.iter (fun cb ->
    if cb (Bytes.empty, 0, 0) then close_proxy ()
) should_close

| None ->
ignore (CBuf.read b' b)
) ;
( match should_close with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is to handle the case where polly returned after a timeout? Could it then go in an else branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the code to call the callback only once by providing a init value(Bytes.empty, 0, 0) to Polly.wait_fold

Comment on lines 203 to 206
if !Xapi_globs.poll_timeout_sec < 0 then
-1
else
!Xapi_globs.poll_timeout_sec * 1000 (* convert to milliseconds *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or:

max (-1) (!Xapi_globs.poll_timeout_sec * 1000) (* convert to milliseconds *)

Copy link
Collaborator Author

@stephenchengCloud stephenchengCloud Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I changed poll_timeout_sec to a float type. The max (-1) ... can not work, say poll_timeout_sec is set to -0.00001 (though it's unlikely to set like this).
So I changed the logic back to the original one.

let get_console_idle_timeout_value ~__context ~vm =
try
let idle_timeout =
if Db.VM.get_is_control_domain ~__context ~self:vm then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this triggers several DB calls every 5 seconds for every console. If the VM is running on a pool member, then that means several calls to the coordinator. I understand that we want to be responsive to timeout changes, with we should reduce this to at most one DB call.
Several of these values are static for given VM and can be captured in the closure before the line with fun (buf, read_len, offset) -> below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Good point.

Db.Pool.get_vm_console_idle_timeout ~__context ~self:pool
in
idle_timeout
with _ -> 0L (* Default: no timeout *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it an option please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

false
| Some parser -> (
try
let data = Bytes.sub_string buf offset read_len in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this still creates a copy. I had hoped that we could avoid that and pass buf+offset+len straight to the parser. Perhaps not worth doing, but I can't remember how the parser processes this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser calls Angstrom.Buffered.feed which needs a string type.
So there must be a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, not even with the unsafe functions the copy can be avoided, as Bytes.subfamily of functions always allocates

Int64.to_float (get_console_idle_timeout_value ~__context ~vm)
in
if is_idle_timeout ~idle_timeout_seconds ~last_activity then (
debug "Idle timeout exceeded! (timeout: %.1fs)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be handy to say that this is about a console connection and for which VM.

exception Close_proxy

let proxy ?(should_close : (bytes * int * int -> bool) option = None)
?(poll_timeout = -1) (a : Unix.file_descr) (b : Unix.file_descr) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This should be a parameter with a default but it should not be an optional value.

(* The timeout (in seconds) for event polling in the proxy loop.
If set to a positive value, the poll will wake up periodically,
which is useful for implementing features like idle timeout or periodic inspection of proxy buffers. *)
let poll_timeout_sec = ref 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected value at the Polly API is an int in milliseconds - so we could take a float in seconds and convert it to milliseconds when we pass it to Polly.

Comment on lines 1791 to 1793
; ( "poll_timeout_sec"
, Arg.Set_int poll_timeout_sec
, (fun () -> string_of_int !poll_timeout_sec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be defined as part of the xapi_globs_spec_with_descriptions list, in line 1259:

  ; ( "proxy_poll_period_timeout"
    , ShortDurationFromSeconds proxy_poll_period_timeout
    , "Timeout (in seconds) for event polling in network proxy loops. When \
       positive, the proxy will wake up periodically to check tasks like vnc \
       idle timeouts or perform other maintenance tasks. Set to -1 to wait \
       indefinitely for network events without periodic wake-ups."
    )

use ShortDurationFromSeconds to have the conf value as float, or LongDurationFromSeconds for int. Because polly accepts milliseconds and the polling loop is bound to timeout in a matter of seconds at most, I recommend the float.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I didn't know this before. But I checked the code, it seems ShortDurationFromSeconds cannot be negative, while epoll can take -1 as input. So I'll use float as Lindig suggested, and it'll be simpler to covert it to milliseconds.

val s_to_span : float -> Mtime.Span.t option
(** [s_to_span seconds] converts a float representing seconds to a timespan.
    Returns None when [seconds] is negative, is not a number or larger than
    ~104 days. Avoid whenever possible, some RPC function already use this so
    it needs to be available. *)

Stephen Cheng added 2 commits October 9, 2025 19:28
fix
Signed-off-by: Stephen Cheng <[email protected]>
Signed-off-by: Stephen Cheng <[email protected]>
last_activity := Mtime_clock.counter () ;

(* The idle_timeout config may change during running the callback,
so need to get the latest *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the callback function is performance critical. We may have to give up on this. Specifically, I think we need to avoid any DB operations within this callback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. It's a design choice. In that way, the timeout config change only applys to the new connections.
I can add this to the design doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need it to apply to current connections, then another option is to have a separate thread watch the DB field using event.from and set a local ref that all console proxies can use. Then DB calls are outside of the critical path. But even simpler to not do it at all...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, better to not do it.
If users want the idle timeout config applys to old connections, it's easy to reconnect the connections.

match parser data with
| Ok messages ->
if is_active messages then
last_activity := Mtime_clock.counter () ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unlikely that it would get timed out after this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We should not call timed_out function if is_active is true.

debug "RFB parse error: %s" error_msg ;
rfb_parser := None ;
false
with e ->
Copy link
Member

@minglumlu minglumlu Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the parsing raise any exception? Will the result return have covered all the error cases? It is the point of result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the parser code, it'll return Error when parsing FAIL.
I'll remove the exception.

let pool = Helpers.get_pool ~__context in
Db.Pool.get_vm_console_idle_timeout ~__context ~self:pool
in
Some idle_timeout
Copy link
Member

@minglumlu minglumlu Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be

if idle_timeout > 0. then Some (Int64.to_float idle_timeout) else None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and tested

messages

let timed_out ~idle_timeout_seconds ~last_activity =
idle_timeout_seconds > 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checking is not needed anymore.

Signed-off-by: Stephen Cheng <[email protected]>
@stephenchengCloud stephenchengCloud force-pushed the private/stephenche/CP-54393 branch from 0e57657 to b238556 Compare October 10, 2025 11:23
@minglumlu
Copy link
Member

Hello, @stephenchengCloud
Could you please have a test for the performance? For example, for the case that several consoles keep refreshing the display, like scrolling the screen.
It would be good to compare the performance between with and without the change.

@stephenchengCloud
Copy link
Collaborator Author

Hello, @stephenchengCloud Could you please have a test for the performance? For example, for the case that several consoles keep refreshing the display, like scrolling the screen. It would be good to compare the performance between with and without the change.

Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants