Releases: pytorch/test-infra
Releases · pytorch/test-infra
v20250929-182904
[autorevert] fix RetryWithBackoff, add tests (#7243) a followup to https://github.com/pytorch/test-infra/pull/7241 fixes the logic and adds unit tests
v20250929-161641
[AUTOREVERT] use secret store over environment variables for password…
v20250929-155929
[AUTOREVERT] Add retry with back-off for GH API and CH (#7241)
Just going on the code, finding where we call external API, and adding a
retry with exponential back-off.
Defaults to 5 retries, 0.5s base and with 10% jitter
There are NO CODE CHANGES, all parts of the code that are relevant are
being guardrailed with:
```
for attempt in RetryWithBackoff():
with attempt:
# the code
```
Changes appear to be big due:
* Extra tabs and the consequent linter changes
* Lazy nature of the gh and ch libraries, that resolve pagination as the
code consume information
v20250929-124114
[autorevert] fix handling for insufficient successes (#7235) Previously the code was trying to group branches for restarts resulting from "infra check" and from "insufficient events", and this was a mistake, resulting in delayed restarts. Specifically, in this situation: <img width="999" height="747" alt="image" src="https://github.com/user-attachments/assets/9cd0051e-8d87-4fe2-af90-88a776847c4d" /> a restart on the success side is expected, but the system waits for pending job on the failure side. This PR decouples and simplifies the logic. Now, all restarts are scheduled independently (relying on set deduplication) and all final checks are performed afterwards. Added a unit test to specifically verify the case above.
v20250926-200342
[AUTOREVERT] Checks label `autorevert: disable` and notify when not r…
v20250926-174226
[autorevert] improve restart logic with pacing, cap, and backoff (#7226) Changes: - workflow_checker.restart_workflow now always dispatches and returns None deduplication on `restart_workflow` removed, as we can dispatch > 2 events total per commit (e.g. when covering gaps) - new restarts gating logic based on CH event history (per commit & wf, only non-dry-run events): - Pacing: skip restart if has a successful restart within 20m of now - Cap: skip if total restarts (successful & failed) >= 5 - Backoff: recent restarts were failures, wait 20m, 40m, 60m (max), cap based on failure streak size
v20250926-151930
[AUTOREVERT] Remove unused files (#7227) just removing some unused files that can't be reached by `__main__`.
v20250926-132458
[autorevert] update failure threshold to 3 for autorevert eligibility…
v20250925-235116
[AUTOREVERT] Adds circuit breaker with issue in pytorch/pytorch 'ci: …
v20250925-190654
opensearch/search similar failures: setup for using ttl (#7222) Some context is https://github.com/pytorch/test-infra/issues/7221 This makes it so that the search can search multiple indexes, and the insertion gets inserted to an index that is based on the month Then we can delete the indices when they get too old (I think this is going to be done in the UI? I'm not sure if this is in terraform) I am also manually deleting records > 1 year old We could also do some stuff with rollovers and aliases?, but I think this is more convenient Testing: Check that the similar failure search still worked but thats it