Skip to content

Releases: pytorch/test-infra

v20250929-182904

29 Sep 18:30
73efcae

Choose a tag to compare

[autorevert] fix RetryWithBackoff, add tests (#7243)

a followup to https://github.com/pytorch/test-infra/pull/7241

fixes the logic and adds unit tests

v20250929-161641

29 Sep 16:18
aa5c240

Choose a tag to compare

[AUTOREVERT] use secret store over environment variables for password…

v20250929-155929

29 Sep 16:01
6ec1bf7

Choose a tag to compare

[AUTOREVERT] Add retry with back-off for GH API and CH (#7241)

Just going on the code, finding where we call external API, and adding a
retry with exponential back-off.

Defaults to 5 retries, 0.5s base and with 10% jitter

There are NO CODE CHANGES, all parts of the code that are relevant are
being guardrailed with:

```
for attempt in RetryWithBackoff():
    with attempt:
        # the code 
```

Changes appear to be big due:

* Extra tabs and the consequent linter changes
* Lazy nature of the gh and ch libraries, that resolve pagination as the
code consume information

v20250929-124114

29 Sep 12:42
06985bf

Choose a tag to compare

[autorevert] fix handling for insufficient successes (#7235)

Previously the code was trying to group branches for restarts resulting
from "infra check" and from "insufficient events", and this was a
mistake, resulting in delayed restarts.

Specifically, in this situation:
<img width="999" height="747" alt="image"
src="https://github.com/user-attachments/assets/9cd0051e-8d87-4fe2-af90-88a776847c4d"
/>
a restart on the success side is expected, but the system waits for
pending job on the failure side.


This PR decouples and simplifies the logic. Now, all restarts are
scheduled independently (relying on set deduplication) and all final
checks are performed afterwards.

Added a unit test to specifically verify the case above.

v20250926-200342

26 Sep 20:05
d2b0c00

Choose a tag to compare

[AUTOREVERT] Checks label `autorevert: disable` and notify when not r…

v20250926-174226

26 Sep 17:44
1b83d3e

Choose a tag to compare

[autorevert] improve restart logic with pacing, cap, and backoff (#7226)

Changes:

- workflow_checker.restart_workflow now always dispatches and returns
None
deduplication on `restart_workflow` removed, as we can dispatch > 2
events total per commit (e.g. when covering gaps)
- new restarts gating logic based on CH event history (per commit & wf,
only non-dry-run events):
  - Pacing: skip restart if has a successful restart within 20m of now
  - Cap: skip if total restarts (successful & failed) >= 5
- Backoff: recent restarts were failures, wait 20m, 40m, 60m (max), cap
based on failure streak size

v20250926-151930

26 Sep 15:21
9f9d729

Choose a tag to compare

[AUTOREVERT] Remove unused files (#7227)

just removing some unused files that can't be reached by `__main__`.

v20250926-132458

26 Sep 13:26
9489aad

Choose a tag to compare

[autorevert] update failure threshold to 3 for autorevert eligibility…

v20250925-235116

25 Sep 23:53
742f25f

Choose a tag to compare

[AUTOREVERT] Adds circuit breaker with issue in pytorch/pytorch 'ci: …

v20250925-190654

25 Sep 19:08
9b326c7

Choose a tag to compare

opensearch/search similar failures: setup for using ttl (#7222)

Some context is https://github.com/pytorch/test-infra/issues/7221

This makes it so that the search can search multiple indexes, and the
insertion gets inserted to an index that is based on the month

Then we can delete the indices when they get too old (I think this is
going to be done in the UI? I'm not sure if this is in terraform)
I am also manually deleting records > 1 year old

We could also do some stuff with rollovers and aliases?, but I think
this is more convenient

Testing:
Check that the similar failure search still worked but thats it