Skip to content

Conversation

@pau-hedgehog
Copy link
Contributor

@pau-hedgehog pau-hedgehog commented Aug 14, 2025

  • Add pause_on_fail input parameter for debugging failed release tests
  • Add -rt suffix to release test job names for better identification
  • Add regex (and inverse flag) for release test filtering input parameter

@pau-hedgehog pau-hedgehog self-assigned this Aug 14, 2025
@pau-hedgehog pau-hedgehog force-pushed the pau/ci_reltest_pause branch 7 times, most recently from d00a2f8 to 6564811 Compare August 14, 2025 21:27
@pau-hedgehog pau-hedgehog requested a review from Copilot August 14, 2025 22:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds a new "pause on fail" feature for debugging VLAB release tests. The feature allows pausing test execution when a failure occurs, with different behavior for interactive vs CI environments.

  • Adds a PauseOnFail boolean option to VLAB configuration and CLI
  • Implements smart pause behavior: interactive prompt for terminal sessions, automatic 60-minute pause for CI
  • Updates GitHub Actions workflows to support the new pause-on-fail functionality

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/hhfab/vlabrunner.go Adds PauseOnFail field to VLABRunOpts struct and passes it to release test configuration
pkg/hhfab/release.go Implements pauseOnFail function with terminal detection and different pause behaviors
cmd/hhfab/main.go Adds CLI flag for pause-on-fail option with alias "-p"
.github/workflows/run-vlab.yaml Adds pause_on_fail input parameter and updates workflow to use it conditionally
.github/workflows/ci.yaml Adds pause_on_fail input parameter and passes it to the run-vlab workflow

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@pau-hedgehog pau-hedgehog changed the title Pau/ci reltest pause ci: add pause-on-fail flag and split connectivity-test from release-test Aug 15, 2025
@pau-hedgehog pau-hedgehog force-pushed the pau/ci_reltest_pause branch 4 times, most recently from 1b281b8 to b337470 Compare August 15, 2025 01:37
@pau-hedgehog pau-hedgehog changed the title ci: add pause-on-fail flag and split connectivity-test from release-test ci: add pause-on-fail flag add suffix for release-test in jobs/artifacts Aug 16, 2025
@pau-hedgehog pau-hedgehog force-pushed the pau/ci_reltest_pause branch 4 times, most recently from 3fd4c53 to f99a064 Compare August 17, 2025 21:14
@pau-hedgehog pau-hedgehog force-pushed the pau/ci_reltest_pause branch 4 times, most recently from 5009b1d to 962c3ba Compare September 10, 2025 08:18
@pau-hedgehog pau-hedgehog marked this pull request as ready for review September 10, 2025 23:44
@pau-hedgehog pau-hedgehog requested review from a team as code owners September 10, 2025 23:44
@pau-hedgehog pau-hedgehog changed the title ci: add pause-on-fail flag add suffix for release-test in jobs/artifacts ci: add features to assist in troubleshooting release test runs Sep 11, 2025
@pau-hedgehog pau-hedgehog requested a review from a team as a code owner September 13, 2025 17:38
Copy link
Member

@Frostman Frostman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think great improvements to allow pause on fail, regex/inverts as flags in CLI, but we shouldn't do that for the CI, and I'm not sure about the script

description: "Run release tests in vlab/hlab"
required: false
default: false
release_test_regex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't allow this in CI, it'll be very deceptive as when you start release tests you expect all to run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I barely use it but I added it to eventually target a specific flaky test on CI. It is tested and works.

The problem with deceptiveness is more a lack of visibility of what we are actually running (unless you dig into the logs) as the current summary just gives you the counts:

#529 (comment)

description: "Run only release tests matched by regex"
required: false
default: ""
release_test_invert_regex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

description: "Run the build with tmate debugging enabled (https://github.com/marketplace/actions/debugging-with-tmate)"
required: false
default: false
pause_on_fail:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't pause on failure especially on hlab as it's a very expensive resource. If debug is required, I suggest to disable worker and run bunch of e.g. release tests manually to have full control as debugging through runner is anyways always limited by timeouts and impossible to restart things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can exclude hlab or add an option in case you want to target a flaky test hitting there. As it's an expensive resource being hit continuously by flaky tests is also a waste of CI time, thus the whole idea of this PR, to be able to hunt issues actively. This proved useful in the TPCM Sonic VS issue. It is how I could nail it down.

By default no pipeline will be paused unless you trigger it specifically to. If you cordon the runner you will end up affecting more jobs than with this PR approach, IMHO

needs:
- test-build

name: "${{ matrix.hybrid && 'h' || 'v' }}-${{ matrix.upgradefrom && 'up' || '' }}${{ matrix.upgradefrom }}${{ matrix.upgradefrom && '-' || '' }}${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ matrix.mesh && 'mesh-' || '' }}${{ matrix.gateway && 'gw-' || '' }}${{ matrix.includeonie && 'onie-' || '' }}${{ matrix.buildmode }}-${{ matrix.vpcmode }}${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have -rt in a job name as it represents a topo and deployment type, not how it's tested. Detailed flags of what is running within it is visible in the beginning of the job already

env:
# global workflow configs
slug: "${{ inputs.hybrid && 'h' || 'v' }}-${{ inputs.upgradefrom }}${{ inputs.upgradefrom && '-' || '' }}${{ inputs.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ inputs.mesh && 'mesh-' || '' }}${{ inputs.gateway && 'gw-' || '' }}${{ inputs.includeonie && 'onie-' || '' }}${{ inputs.buildmode }}-${{ inputs.vpcmode }}"
slug: "${{ inputs.hybrid && 'h' || 'v' }}-${{ inputs.upgradefrom }}${{ inputs.upgradefrom && '-' || '' }}${{ inputs.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ inputs.mesh && 'mesh-' || '' }}${{ inputs.gateway && 'gw-' || '' }}${{ inputs.includeonie && 'onie-' || '' }}${{ inputs.buildmode }}-${{ inputs.vpcmode }}${{ inputs.releasetest && '-rt' || '' }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@@ -0,0 +1,232 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like quite a big and generic script to maintain in the fabricator repo. You can put it in some lab/internal repo if you think it makes sense to have in git somewhere.

Copy link
Contributor Author

@pau-hedgehog pau-hedgehog Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was doubting to add it here or lab repo, where I have a related script to jump into the runners. I was proposing here because it could be handy for anyone to get a compact overview of running Ci workflows without having to drill down into the GHA Web UI:

⏳ Workflow: CI (87m) - Branch: pau/ci_reltest_pause                                           🟡 Workflow: CI (93m) - Branch: dev/frostman/stability
      ✅ v-mesh-iso-l3vni (Runner: 'vlab-6bggx-runner-bgqj2')                                        ✅ v-gw-onie-usb-l2vni (Runner: 'vlab-6bggx-runner-ww25s')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770734           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492869
      🟡 v-up25.04-iso-l3vni (Runner: 'vlab-6bggx-runner-9xptf') (47m)                               ✅ v-manual-l2vni (Runner: 'vlab-6bggx-runner-mc8sn')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770738           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492878
      ✅ v-iso-l3vni (Runner: 'vlab-6bggx-runner-2pnfz')                                             🟡 v-iso-l3vni (Runner: 'vlab-6bggx-runner-v5v5d') (40m)
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770740           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492879
      🟡 v-mesh-iso-l2vni (Runner: 'vlab-6bggx-runner-rj4bx') (34m)                                  ✅ v-gw-onie-iso-l2vni (Runner: 'vlab-6bggx-runner-xc272')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770764           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492882
      ✅ v-gw-iso-l3vni (Runner: 'vlab-6bggx-runner-gz6km')                                          ✅ v-gw-iso-l3vni (Runner: 'vlab-6bggx-runner-wlssw')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770765           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492886
      ⏳ h-iso-l2vni                                                                                 🟡 h-iso-l2vni (Runner: 'hlab-gbvbd-runner-4djxz') (85m)
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770767           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492898
      ✅ v-gw-onie-usb-l2vni (Runner: 'vlab-6bggx-runner-p29gz')                                     ✅ v-up25.04-iso-l3vni (Runner: 'vlab-6bggx-runner-9jjl6')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770770           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492902
      🟡 v-usb-l2vni (Runner: 'vlab-6bggx-runner-cnssw') (29m)                                       ✅ v-usb-l2vni (Runner: 'vlab-6bggx-runner-qbpgc')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770781           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492922
      ✅ v-manual-l2vni (Runner: 'vlab-6bggx-runner-8c4nd')                                          ✅ v-cc-iso-l2vni (Runner: 'vlab-6bggx-runner-67jds')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770785           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492929
      ✅ v-cc-iso-l2vni (Runner: 'vlab-6bggx-runner-t4mj9')                                          ✅ v-mesh-iso-l2vni (Runner: 'vlab-6bggx-runner-rlfhk')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770787           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492949
      ✅ v-iso-l2vni (Runner: 'vlab-6bggx-runner-v6dkl')                                             ✅ v-mesh-iso-l3vni (Runner: 'vlab-6bggx-runner-wxp6t')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770788           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365493006
      🟡 v-gw-onie-iso-l2vni (Runner: 'vlab-6bggx-runner-kvh42') (26m)                               ✅ v-iso-l2vni (Runner: 'vlab-6bggx-runner-knwth')
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770806           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365493094
      🟡 v-up25.04-iso-l2vni (Runner: 'vlab-6bggx-runner-4lvjq') (33m)                               🟡 v-up25.04-iso-l2vni (Runner: 'vlab-6bggx-runner-8jvng') (68m)
         → https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770807           → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365493112

=== Updated: 11:00:25 | ⏳ Queued 🟡 Running ✅ Success ❌ Failed 🚫 Cancelled 🚨 Paused ===

This will save me hours of browsing the UI. I will move it to the lab repo

echo "=== System Information and Status ==="
run_sonic_cmd "show version"
run_sonic_cmd "show uptime"
run_sonic_cmd "show system status brief"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to run status detail as brief would be limited to the system readiness summary

var input string
_, _ = fmt.Scanln(&input)
slog.Info("Continuing...")
if !term.IsTerminal(int(os.Stdin.Fd())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't make CI pause, especially for 60 minutes. It'll be still almost impossible to catch as it's a very short window of opportunity plus what I said in a first comment

@github-actions
Copy link

github-actions bot commented Sep 15, 2025

Test Results

 12 files  48 suites   2h 59m 39s ⏱️
 20 tests  6 ✅  14 💤 0 ❌
240 runs  63 ✅ 177 💤 0 ❌

Results for commit 3fb39b5.

♻️ This comment has been updated with latest results.

@pau-hedgehog
Copy link
Contributor Author

@Frostman , please comment if you find this useful (I may keep rebasing the branch for personal use when in need for debugging) or not and if shall i remove the rt and regex commits or not. Thanks in advance

@Frostman Frostman requested a review from Copilot September 30, 2025 20:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 2446 to +2447
func pauseOnFail() {
// pause until the user presses enter
slog.Warn("Test failed, pausing execution. Note that reverts might still need to apply, so if you intend to continue, please make sure to leave the environment in the same state as you found it")
slog.Info("Press enter to continue...")
var input string
_, _ = fmt.Scanln(&input)
slog.Info("Continuing...")
if !term.IsTerminal(int(os.Stdin.Fd())) {
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling for os.Stdin.Fd() as it can fail in some environments where stdin is not available or accessible.

Suggested change
func pauseOnFail() {
// pause until the user presses enter
slog.Warn("Test failed, pausing execution. Note that reverts might still need to apply, so if you intend to continue, please make sure to leave the environment in the same state as you found it")
slog.Info("Press enter to continue...")
var input string
_, _ = fmt.Scanln(&input)
slog.Info("Continuing...")
if !term.IsTerminal(int(os.Stdin.Fd())) {
func isStdinTerminal() bool {
defer func() {
if r := recover(); r != nil {
// If Fd() panics, treat as non-terminal
}
}()
return term.IsTerminal(int(os.Stdin.Fd()))
}
func pauseOnFail() {
if !isStdinTerminal() {

Copilot uses AI. Check for mistakes.
slog.Info("Continuing...")
if !term.IsTerminal(int(os.Stdin.Fd())) {
// CI environment - pause for a long time to allow debugging
pauseDuration := 60 * time.Minute
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded 60-minute pause duration should be configurable or at least defined as a constant to make it easier to adjust for different debugging scenarios.

Copilot uses AI. Check for mistakes.
Comment on lines +240 to +241
name: "${{ matrix.hybrid && 'h' || 'v' }}-${{ matrix.upgradefrom && 'up' || '' }}${{ matrix.upgradefrom }}${{ matrix.upgradefrom && '-' || '' }}${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ matrix.mesh && 'mesh-' || '' }}${{ matrix.gateway && 'gw-' || '' }}${{ matrix.includeonie && 'onie-' || '' }}${{ matrix.buildmode }}-${{ matrix.vpcmode }}${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }}"

Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job name expression is extremely long and complex. Consider breaking it into multiple variables or using a more readable format to improve maintainability.

Suggested change
name: "${{ matrix.hybrid && 'h' || 'v' }}-${{ matrix.upgradefrom && 'up' || '' }}${{ matrix.upgradefrom }}${{ matrix.upgradefrom && '-' || '' }}${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ matrix.mesh && 'mesh-' || '' }}${{ matrix.gateway && 'gw-' || '' }}${{ matrix.includeonie && 'onie-' || '' }}${{ matrix.buildmode }}-${{ matrix.vpcmode }}${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }}"
env:
HYBRID_PREFIX: ${{ matrix.hybrid && 'h' || 'v' }}
UPGRADE_PREFIX: ${{ matrix.upgradefrom && 'up' || '' }}
UPGRADE_VALUE: ${{ matrix.upgradefrom }}
UPGRADE_DASH: ${{ matrix.upgradefrom && '-' || '' }}
FABRICMODE_PREFIX: ${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }}
MESH_PREFIX: ${{ matrix.mesh && 'mesh-' || '' }}
GW_PREFIX: ${{ matrix.gateway && 'gw-' || '' }}
ONIE_PREFIX: ${{ matrix.includeonie && 'onie-' || '' }}
BUILD_MODE: ${{ matrix.buildmode }}
VPC_MODE: ${{ matrix.vpcmode }}
RELEASE_TEST: ${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }}
name: "${{ env.HYBRID_PREFIX }}-${{ env.UPGRADE_PREFIX }}${{ env.UPGRADE_VALUE }}${{ env.UPGRADE_DASH }}${{ env.FABRICMODE_PREFIX }}${{ env.MESH_PREFIX }}${{ env.GW_PREFIX }}${{ env.ONIE_PREFIX }}${{ env.BUILD_MODE }}-${{ env.VPC_MODE }}${{ env.RELEASE_TEST }}"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants