-
Couldn't load subscription status.
- Fork 5
ci: add features to assist in troubleshooting release test runs #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d00a2f8 to
6564811
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds a new "pause on fail" feature for debugging VLAB release tests. The feature allows pausing test execution when a failure occurs, with different behavior for interactive vs CI environments.
- Adds a
PauseOnFailboolean option to VLAB configuration and CLI - Implements smart pause behavior: interactive prompt for terminal sessions, automatic 60-minute pause for CI
- Updates GitHub Actions workflows to support the new pause-on-fail functionality
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/hhfab/vlabrunner.go | Adds PauseOnFail field to VLABRunOpts struct and passes it to release test configuration |
| pkg/hhfab/release.go | Implements pauseOnFail function with terminal detection and different pause behaviors |
| cmd/hhfab/main.go | Adds CLI flag for pause-on-fail option with alias "-p" |
| .github/workflows/run-vlab.yaml | Adds pause_on_fail input parameter and updates workflow to use it conditionally |
| .github/workflows/ci.yaml | Adds pause_on_fail input parameter and passes it to the run-vlab workflow |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
6564811 to
f9a9b54
Compare
1b281b8 to
b337470
Compare
3fd4c53 to
f99a064
Compare
5009b1d to
962c3ba
Compare
962c3ba to
0754bc2
Compare
0754bc2 to
fa70dfc
Compare
fa70dfc to
4ed61eb
Compare
4ed61eb to
941cf11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think great improvements to allow pause on fail, regex/inverts as flags in CLI, but we shouldn't do that for the CI, and I'm not sure about the script
| description: "Run release tests in vlab/hlab" | ||
| required: false | ||
| default: false | ||
| release_test_regex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't allow this in CI, it'll be very deceptive as when you start release tests you expect all to run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I barely use it but I added it to eventually target a specific flaky test on CI. It is tested and works.
The problem with deceptiveness is more a lack of visibility of what we are actually running (unless you dig into the logs) as the current summary just gives you the counts:
| description: "Run only release tests matched by regex" | ||
| required: false | ||
| default: "" | ||
| release_test_invert_regex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| description: "Run the build with tmate debugging enabled (https://github.com/marketplace/actions/debugging-with-tmate)" | ||
| required: false | ||
| default: false | ||
| pause_on_fail: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't pause on failure especially on hlab as it's a very expensive resource. If debug is required, I suggest to disable worker and run bunch of e.g. release tests manually to have full control as debugging through runner is anyways always limited by timeouts and impossible to restart things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can exclude hlab or add an option in case you want to target a flaky test hitting there. As it's an expensive resource being hit continuously by flaky tests is also a waste of CI time, thus the whole idea of this PR, to be able to hunt issues actively. This proved useful in the TPCM Sonic VS issue. It is how I could nail it down.
By default no pipeline will be paused unless you trigger it specifically to. If you cordon the runner you will end up affecting more jobs than with this PR approach, IMHO
| needs: | ||
| - test-build | ||
|
|
||
| name: "${{ matrix.hybrid && 'h' || 'v' }}-${{ matrix.upgradefrom && 'up' || '' }}${{ matrix.upgradefrom }}${{ matrix.upgradefrom && '-' || '' }}${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ matrix.mesh && 'mesh-' || '' }}${{ matrix.gateway && 'gw-' || '' }}${{ matrix.includeonie && 'onie-' || '' }}${{ matrix.buildmode }}-${{ matrix.vpcmode }}${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to have -rt in a job name as it represents a topo and deployment type, not how it's tested. Detailed flags of what is running within it is visible in the beginning of the job already
| env: | ||
| # global workflow configs | ||
| slug: "${{ inputs.hybrid && 'h' || 'v' }}-${{ inputs.upgradefrom }}${{ inputs.upgradefrom && '-' || '' }}${{ inputs.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ inputs.mesh && 'mesh-' || '' }}${{ inputs.gateway && 'gw-' || '' }}${{ inputs.includeonie && 'onie-' || '' }}${{ inputs.buildmode }}-${{ inputs.vpcmode }}" | ||
| slug: "${{ inputs.hybrid && 'h' || 'v' }}-${{ inputs.upgradefrom }}${{ inputs.upgradefrom && '-' || '' }}${{ inputs.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ inputs.mesh && 'mesh-' || '' }}${{ inputs.gateway && 'gw-' || '' }}${{ inputs.includeonie && 'onie-' || '' }}${{ inputs.buildmode }}-${{ inputs.vpcmode }}${{ inputs.releasetest && '-rt' || '' }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
hack/monitor-ci-paused-jobs.sh
Outdated
| @@ -0,0 +1,232 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like quite a big and generic script to maintain in the fabricator repo. You can put it in some lab/internal repo if you think it makes sense to have in git somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was doubting to add it here or lab repo, where I have a related script to jump into the runners. I was proposing here because it could be handy for anyone to get a compact overview of running Ci workflows without having to drill down into the GHA Web UI:
⏳ Workflow: CI (87m) - Branch: pau/ci_reltest_pause 🟡 Workflow: CI (93m) - Branch: dev/frostman/stability
✅ v-mesh-iso-l3vni (Runner: 'vlab-6bggx-runner-bgqj2') ✅ v-gw-onie-usb-l2vni (Runner: 'vlab-6bggx-runner-ww25s')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770734 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492869
🟡 v-up25.04-iso-l3vni (Runner: 'vlab-6bggx-runner-9xptf') (47m) ✅ v-manual-l2vni (Runner: 'vlab-6bggx-runner-mc8sn')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770738 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492878
✅ v-iso-l3vni (Runner: 'vlab-6bggx-runner-2pnfz') 🟡 v-iso-l3vni (Runner: 'vlab-6bggx-runner-v5v5d') (40m)
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770740 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492879
🟡 v-mesh-iso-l2vni (Runner: 'vlab-6bggx-runner-rj4bx') (34m) ✅ v-gw-onie-iso-l2vni (Runner: 'vlab-6bggx-runner-xc272')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770764 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492882
✅ v-gw-iso-l3vni (Runner: 'vlab-6bggx-runner-gz6km') ✅ v-gw-iso-l3vni (Runner: 'vlab-6bggx-runner-wlssw')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770765 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492886
⏳ h-iso-l2vni 🟡 h-iso-l2vni (Runner: 'hlab-gbvbd-runner-4djxz') (85m)
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770767 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492898
✅ v-gw-onie-usb-l2vni (Runner: 'vlab-6bggx-runner-p29gz') ✅ v-up25.04-iso-l3vni (Runner: 'vlab-6bggx-runner-9jjl6')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770770 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492902
🟡 v-usb-l2vni (Runner: 'vlab-6bggx-runner-cnssw') (29m) ✅ v-usb-l2vni (Runner: 'vlab-6bggx-runner-qbpgc')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770781 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492922
✅ v-manual-l2vni (Runner: 'vlab-6bggx-runner-8c4nd') ✅ v-cc-iso-l2vni (Runner: 'vlab-6bggx-runner-67jds')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770785 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492929
✅ v-cc-iso-l2vni (Runner: 'vlab-6bggx-runner-t4mj9') ✅ v-mesh-iso-l2vni (Runner: 'vlab-6bggx-runner-rlfhk')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770787 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365492949
✅ v-iso-l2vni (Runner: 'vlab-6bggx-runner-v6dkl') ✅ v-mesh-iso-l3vni (Runner: 'vlab-6bggx-runner-wxp6t')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770788 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365493006
🟡 v-gw-onie-iso-l2vni (Runner: 'vlab-6bggx-runner-kvh42') (26m) ✅ v-iso-l2vni (Runner: 'vlab-6bggx-runner-knwth')
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770806 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365493094
🟡 v-up25.04-iso-l2vni (Runner: 'vlab-6bggx-runner-4lvjq') (33m) 🟡 v-up25.04-iso-l2vni (Runner: 'vlab-6bggx-runner-8jvng') (68m)
→ https://github.com/githedgehog/fabricator/actions/runs/17725549258/job/50365770807 → https://github.com/githedgehog/fabricator/actions/runs/17725413073/job/50365493112
=== Updated: 11:00:25 | ⏳ Queued 🟡 Running ✅ Success ❌ Failed 🚫 Cancelled 🚨 Paused ===
This will save me hours of browsing the UI. I will move it to the lab repo
pkg/hhfab/show-tech/switch.sh
Outdated
| echo "=== System Information and Status ===" | ||
| run_sonic_cmd "show version" | ||
| run_sonic_cmd "show uptime" | ||
| run_sonic_cmd "show system status brief" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to run status detail as brief would be limited to the system readiness summary
| var input string | ||
| _, _ = fmt.Scanln(&input) | ||
| slog.Info("Continuing...") | ||
| if !term.IsTerminal(int(os.Stdin.Fd())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't make CI pause, especially for 60 minutes. It'll be still almost impossible to catch as it's a very short window of opportunity plus what I said in a first comment
848f049 to
b1237b0
Compare
Test Results 12 files 48 suites 2h 59m 39s ⏱️ Results for commit 3fb39b5. ♻️ This comment has been updated with latest results. |
Signed-off-by: Pau Capdevila <[email protected]>
Signed-off-by: Pau Capdevila <[email protected]>
Signed-off-by: Pau Capdevila <[email protected]>
Signed-off-by: Pau Capdevila <[email protected]>
Signed-off-by: Pau Capdevila <[email protected]>
b1237b0 to
3fb39b5
Compare
|
@Frostman , please comment if you find this useful (I may keep rebasing the branch for personal use when in need for debugging) or not and if shall i remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| func pauseOnFail() { | ||
| // pause until the user presses enter | ||
| slog.Warn("Test failed, pausing execution. Note that reverts might still need to apply, so if you intend to continue, please make sure to leave the environment in the same state as you found it") | ||
| slog.Info("Press enter to continue...") | ||
| var input string | ||
| _, _ = fmt.Scanln(&input) | ||
| slog.Info("Continuing...") | ||
| if !term.IsTerminal(int(os.Stdin.Fd())) { |
Copilot
AI
Sep 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling for os.Stdin.Fd() as it can fail in some environments where stdin is not available or accessible.
| func pauseOnFail() { | |
| // pause until the user presses enter | |
| slog.Warn("Test failed, pausing execution. Note that reverts might still need to apply, so if you intend to continue, please make sure to leave the environment in the same state as you found it") | |
| slog.Info("Press enter to continue...") | |
| var input string | |
| _, _ = fmt.Scanln(&input) | |
| slog.Info("Continuing...") | |
| if !term.IsTerminal(int(os.Stdin.Fd())) { | |
| func isStdinTerminal() bool { | |
| defer func() { | |
| if r := recover(); r != nil { | |
| // If Fd() panics, treat as non-terminal | |
| } | |
| }() | |
| return term.IsTerminal(int(os.Stdin.Fd())) | |
| } | |
| func pauseOnFail() { | |
| if !isStdinTerminal() { |
| slog.Info("Continuing...") | ||
| if !term.IsTerminal(int(os.Stdin.Fd())) { | ||
| // CI environment - pause for a long time to allow debugging | ||
| pauseDuration := 60 * time.Minute |
Copilot
AI
Sep 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded 60-minute pause duration should be configurable or at least defined as a constant to make it easier to adjust for different debugging scenarios.
| name: "${{ matrix.hybrid && 'h' || 'v' }}-${{ matrix.upgradefrom && 'up' || '' }}${{ matrix.upgradefrom }}${{ matrix.upgradefrom && '-' || '' }}${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ matrix.mesh && 'mesh-' || '' }}${{ matrix.gateway && 'gw-' || '' }}${{ matrix.includeonie && 'onie-' || '' }}${{ matrix.buildmode }}-${{ matrix.vpcmode }}${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }}" | ||
|
|
Copilot
AI
Sep 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This job name expression is extremely long and complex. Consider breaking it into multiple variables or using a more readable format to improve maintainability.
| name: "${{ matrix.hybrid && 'h' || 'v' }}-${{ matrix.upgradefrom && 'up' || '' }}${{ matrix.upgradefrom }}${{ matrix.upgradefrom && '-' || '' }}${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }}${{ matrix.mesh && 'mesh-' || '' }}${{ matrix.gateway && 'gw-' || '' }}${{ matrix.includeonie && 'onie-' || '' }}${{ matrix.buildmode }}-${{ matrix.vpcmode }}${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }}" | |
| env: | |
| HYBRID_PREFIX: ${{ matrix.hybrid && 'h' || 'v' }} | |
| UPGRADE_PREFIX: ${{ matrix.upgradefrom && 'up' || '' }} | |
| UPGRADE_VALUE: ${{ matrix.upgradefrom }} | |
| UPGRADE_DASH: ${{ matrix.upgradefrom && '-' || '' }} | |
| FABRICMODE_PREFIX: ${{ matrix.fabricmode == 'collapsed-core' && 'cc-' || '' }} | |
| MESH_PREFIX: ${{ matrix.mesh && 'mesh-' || '' }} | |
| GW_PREFIX: ${{ matrix.gateway && 'gw-' || '' }} | |
| ONIE_PREFIX: ${{ matrix.includeonie && 'onie-' || '' }} | |
| BUILD_MODE: ${{ matrix.buildmode }} | |
| VPC_MODE: ${{ matrix.vpcmode }} | |
| RELEASE_TEST: ${{ (inputs.releasetest == true || contains(github.event.pull_request.labels.*.name, 'ci:+release')) && '-rt' || '' }} | |
| name: "${{ env.HYBRID_PREFIX }}-${{ env.UPGRADE_PREFIX }}${{ env.UPGRADE_VALUE }}${{ env.UPGRADE_DASH }}${{ env.FABRICMODE_PREFIX }}${{ env.MESH_PREFIX }}${{ env.GW_PREFIX }}${{ env.ONIE_PREFIX }}${{ env.BUILD_MODE }}-${{ env.VPC_MODE }}${{ env.RELEASE_TEST }}" |
Uh oh!
There was an error while loading. Please reload this page.