-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Recently, the following libraries tests have timed out running the superpmi collection pipeline:
Libraries SuperPMI collection libraries_tests Checked coreclr osx arm64 Release
Libraries SuperPMI collection libraries_tests_no_tiered_compilation Checked coreclr osx arm64 Release
with the message in the console log:
Waiting for completion of job 362925e1-39d3-483a-a3a1-b4684a66a5d6 on osx.1200.arm64
##[error]The Operation will be canceled. The next steps may not contain expected logs.
##[error]The operation was canceled.
this started with the 20250403.1 run, which was manually run by @AndyAyersMS after #114191 to increase the inlining budget.
Before this, these libraries tests still frequently failed, but not with as consistent a "timeout" seeming reason.
The oldest failing logs we have is 20250216.1, which still has a timeout.
Based on information from our internal CI engineering team, we have 10 Mac machines servicing the osx.1200.arm64 Helix queue which is used to run os-arm64 jobs for this pipeline. This pipeline creates a huge number of osx-arm64 tasks, and now it seems those tasks take longer to run. In particular, jobs wait for a machine to become available and we time out the job before all the tasks can be picked up by the work machines.
One solution is simply to increase the timeout. Another is for the engineering team to increase the number of osx-arm64 machines available. Some time ago, the public queues switched from osx.1200.arm64.open to osx.13.arm64.open: #112647. We could change the internal pipeline to do the same, switching to osx.13.arm64. However, this queue only has 8 machines currently.
@dotnet/jit-contrib