Skip to content

Conversation

@Firefishy
Copy link
Contributor

@Firefishy Firefishy commented Nov 10, 2025

Change summary

Add vyos support for having the hardware watchdog managed by systemd. Have systemd also use the hardware watchdog to control shutdown and reboot timeouts.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes)
  • Migration from an old Vyatta component to vyos-1x, please link to related PR inside obsoleted component
  • Other (please describe):

Related Task(s)

https://vyos.dev/T7101

Related PR(s)

vyos/vyos-documentation#1703 (feature documentation)

How to test / Smoketest result

Checklist:

  • I have read the CONTRIBUTING document
  • I have linked this PR to one or more Phabricator Task(s)
  • I have run the components SMOKETESTS if applicable
  • My commit headlines contain a valid Task id
  • My change requires a change to the documentation
  • I have updated the documentation accordingly

@github-actions
Copy link

github-actions bot commented Nov 10, 2025

👍
No issues in PR Title / Commit Title

@Firefishy Firefishy changed the title T7101 T7101: Add hardware watchdog support via systemd Nov 10, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds hardware watchdog support to VyOS, allowing users to configure systemd's runtime watchdog functionality to monitor system health and trigger automatic reboots on hang/freeze conditions.

Key changes:

  • New Python configuration module for hardware watchdog with configurable timeouts
  • XML interface definition for CLI commands to enable/disable watchdog and set parameters
  • Jinja2 template for systemd watchdog configuration

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/conf_mode/system_watchdog.py Core configuration logic for watchdog including validation, module loading, and systemd configuration
interface-definitions/system_watchdog.xml.in CLI interface definition with timeout parameters and module specification
data/templates/system/watchdog.conf.j2 Systemd configuration template for watchdog runtime settings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Firefishy Firefishy force-pushed the T7101 branch 3 times, most recently from b066fe4 to 8796ec3 Compare November 10, 2025 21:18
@Firefishy
Copy link
Contributor Author

Added a basic smoketest for the watchdog: smoketest/scripts/cli/test_system_watchdog.py

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot finished reviewing on behalf of Firefishy November 11, 2025 13:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sever-sever
Copy link
Member

Binary cannot build
Try build the package dpkg-builpackage -us -uc -b in the vyos-build container

 <string>:0:0:ERROR:RELAXNGV:RELAXNG_ERR_INTEREXTRA: Extra element children in interleave
build/interface-definitions/system_watchdog.xml:3:0:ERROR:RELAXNGV:RELAXNG_ERR_CONTENTVALID: Element node failed to validate content
Interface definition file build/interface-definitions/system_watchdog.xml does not match the schema!
make[2]: *** [Makefile:32: interface_definitions] Error 1
make[2]: Leaving directory '/__w/vyos-1x/vyos-1x/packages/vyos-1x'
make[1]: *** [debian/rules:49: override_dh_auto_build] Error 2
make[1]: Leaving directory '/__w/vyos-1x/vyos-1x/packages/vyos-1x'
make: *** [debian/rules:39: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2

@Firefishy Firefishy force-pushed the T7101 branch 2 times, most recently from 4dfe354 to abf012c Compare November 11, 2025 14:23
@Firefishy
Copy link
Contributor Author

The failed smoktest looks like a false positive:

DEBUG - test_bgp_99_bmp (__main__.TestProtocolsBGP.test_bgp_99_bmp) ... FAIL
DEBUG - 
DEBUG - ======================================================================
DEBUG - FAIL: test_bgp_99_bmp (__main__.TestProtocolsBGP.test_bgp_99_bmp)
DEBUG - ----------------------------------------------------------------------
DEBUG - Traceback (most recent call last):
DEBUG -   File "/usr/libexec/vyos/tests/smoke/cli/test_protocols_bgp.py", line 231, in tearDown
DEBUG -     super().tearDown()
DEBUG -   File "/usr/libexec/vyos/tests/smoke/cli/base_vyostest_shim.py", line 83, in tearDown
DEBUG -     self.assertEqual(self.mgmt_daemon_pid, process_named_running(mgmt_daemon))
DEBUG - AssertionError: 1664 != 98762
DEBUG - 
DEBUG - ----------------------------------------------------------------------
DEBUG - Ran 30 tests in 192.167s
DEBUG - 
DEBUG - FAILED (failures=1)

@Firefishy
Copy link
Contributor Author

I've made all the requested changes. Let me know if anything more is required.

A minor addition if desired might be to allow specify which /dev/watchdogX device to use. There can be multiple watchdog devices on some systems and it can sometimes be preferential to use a specific device.

@Firefishy Firefishy requested a review from Copilot November 12, 2025 17:37
Copilot finished reviewing on behalf of Firefishy November 12, 2025 17:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +92
try:
modules_load_directory.mkdir(exist_ok=True)
modules_load_file.write_text(f"{module}\n")
except Exception as e:
print(f"Warning: Failed writing modules-load configuration: {e}")
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare Exception in except clauses is too broad and could mask unexpected errors. Consider catching more specific exceptions (e.g., OSError, IOError) or letting unexpected exceptions propagate to be caught by the airbag error handler.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this to VyOS team to decide.

Comment on lines +101 to +104
try:
call(f'modprobe {module}')
except Exception as e:
print(f"Warning: Could not load watchdog module '{module}': {e}")
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare Exception in except clauses is too broad and could mask unexpected errors. Consider catching more specific exceptions related to process execution (e.g., OSError) or letting unexpected exceptions propagate to be caught by the airbag error handler.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this to VyOS team to decide.

@github-actions
Copy link

CI integration ❌ failed!

Details

CI logs

  • CLI Smoketests (no interfaces) 👍 passed
  • CLI Smoketests VPP 👍 passed
  • CLI Smoketests (interfaces only) 👍 passed
  • Config tests 👍 passed
  • Config tests VPP 👍 passed
  • RAID1 tests ❌ failed
  • TPM tests 👍 passed

Copy link
Member

@dmbaturin dmbaturin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea and the CLI seems fine to me. I left a couple of questions/suggestions that I'd like to address before we merge it, in addition to concerns people already pointed out.

<help>Kernel module to load for watchdog device (optional)</help>
<valueHelp>
<format>txt</format>
<description>Module name (e.g. 'softdog', 'iTCO_wdt', 'sp5100_tco')</description>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally prefer set-time validation over commit-time validation whenever possible, since it provides a faster feedback and the user doesn't have to commit to discover that some options are wrong. Two questions from this perspective:

  1. How likely is it that the module for the hardware watchdog actually present in the system will not be loaded? Can we fully automate that and either load the module for the watchdog that the system has or fail the commit if there's no module matching the watchdog?
  2. If we can't fully automate it, can we add an explicit list of available watchdog module for set-time validation?

</leafNode>
<leafNode name="timeout">
<properties>
<help>Watchdog timeout for runtime in seconds (1-86400)</help>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 86400 the upper limit for those timeouts in the kernel or systemd? If it's not, I believe we shouldn't introduce an artificial limit — one person's constant is another person's variable, as Alan Perlis quipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max value of watchdog timeout varies per watchdog. Eg: bcm2835_wdt has a max timeout of 16 seconds. The iTCO_wdt module on v1 hardware is around 30 seconds and as high as 10mins on V2 hardware.

I will investigate further, it is likely that the max value can be read from some modules once it has been loaded.

Copy link
Member

@sarthurdev sarthurdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quickly addressing the requested changes.

I think we just need to figure out how to handle the module validation, XML-based is always the preferred method. But we also should validate the specified module is a watchdog driver.

Comment on lines +28 to +31
watchdog_config_dir = Path('/run/systemd/system.conf.d')
watchdog_config_file = Path(watchdog_config_dir / 'watchdog.conf')
modules_load_directory = Path('/run/modules-load.d')
modules_load_file = Path(modules_load_directory / 'watchdog.conf')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
watchdog_config_dir = Path('/run/systemd/system.conf.d')
watchdog_config_file = Path(watchdog_config_dir / 'watchdog.conf')
modules_load_directory = Path('/run/modules-load.d')
modules_load_file = Path(modules_load_directory / 'watchdog.conf')
watchdog_config_dir = Path('/run/systemd/system.conf.d')
watchdog_config_file = watchdog_config_dir / 'watchdog.conf'
modules_load_directory = Path('/run/modules-load.d')
modules_load_file = modules_load_directory / 'watchdog.conf'

Don't need to encapsulate the file paths in another Path()

# If a module is provided, verify it resolves via modprobe (dry-run)
if module:
# Dry-run modprobe (-n) in quiet mode (-q) verifies availability without loading
rc = run(f'modprobe -n -q {module}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only issue here is that it doesn't verify the specified module is a valid watchdog driver.

As per @dmbaturin's comment, probably best to build a validator or list of watchdog modules enabled in the VyOS kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

5 participants