Skip to content

Releases: pytorch/test-infra

v20250930-222836

30 Sep 22:30
e936529
Compare
Choose a tag to compare
[autorever] exclude unstable jobs (#7260)

v20250930-134331

30 Sep 13:45
99554ad
Compare
Choose a tag to compare
[AUTOREVERT] [BUGFIX] fixing typo in variable name preventing revert …

v20250930-125800

30 Sep 12:59
53c6bdf
Compare
Choose a tag to compare
[autorevert] correctly fetch and build the gaps in the signal (#7248)

1. Fixed commits-without-jobs issue

- Problem: Commits with no workflow jobs (e.g., periodic workflow) were
excluded from signal extraction
  - Solution:
    - Added fetch_commits_in_time_range() to query push table directly
- Modified job query to filter by explicit list of head_shas instead of
JOIN
- Changed ORDER BY to use sha dimension first (preserves grouping,
actual order doesn't matter as internally extractors now iterate over
the list of commits passed explicitly)


  2. Added mandatory timestamp field to SignalCommit

  - Changes:
- SignalCommit.__init__(head_sha, timestamp, events) - timestamp is now
mandatory
    - Signal extraction populates timestamps from push table
- HUD state logger uses commit timestamp instead of computing from event
times
    - Updated 36 test constructor calls
    
    
    
  ### Testing
  
  Before:
  

[2025-09-29T19-29-47.670686-00-00.html](https://github.com/user-attachments/files/22606856/2025-09-29T19-29-47.670686-00-00.html)


After:

[2025-09-29T21-38-10.190584-00-00.html](https://github.com/user-attachments/files/22606859/2025-09-29T21-38-10.190584-00-00.html)

v20250929-230908

29 Sep 23:10
44b32da
Compare
Choose a tag to compare
[autorever] fix indentation in `fetch_tests_for_job_ids` (#7250)

Accidentally noticed another bug introduced by
https://github.com/pytorch/test-infra/pull/7241 when testing locally on
the large lookback windows:

```
python -m pytorch_auto_revert --dry-run autorevert-checker periodic --hours 256 --bisection-limit 2   --hud-html
2025-09-29 15:56:16,356 INFO [root] [v2] Start: workflows=periodic hours=256 repo=pytorch/pytorch restart_action=log revert_action=log notify_issue_number=163650
2025-09-29 15:56:16,356 INFO [root] [v2] Run timestamp (CH log ts) = 2025-09-29T22:56:16.356213+00:00
2025-09-29 15:56:16,356 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching commits in time range: repo=pytorch/pytorch lookback=256h
2025-09-29 15:56:16,909 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Commits fetched: 419 commits in 0.55s
2025-09-29 15:56:16,909 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching jobs: repo=pytorch/pytorch workflows=periodic commits=419 lookback=256h
2025-09-29 15:56:56,850 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Jobs fetched: 2848 rows in 39.94s
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching tests for 1077 job_ids (453 failed jobs) in batches
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 1/2 (size=1024)
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] existing rows: 0
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 2/2 (size=53)
2025-09-29 15:56:56,859 INFO [pytorch_auto_revert.signal_extraction_datasource] existing rows: 0
2025-09-29 15:56:57,718 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Tests fetched: 265 rows for 1077 job_ids in 0.86s
```

notice, that no tests are read in the first batch!


after this fix:
```
python -m pytorch_auto_revert --dry-run autorevert-checker periodic --hours 256   --hud-html
2025-09-29 16:03:06,896 INFO [root] [v2] Start: workflows=periodic hours=256 repo=pytorch/pytorch restart_action=log revert_action=log notify_issue_number=163650
2025-09-29 16:03:06,896 INFO [root] [v2] Run timestamp (CH log ts) = 2025-09-29T23:03:06.896595+00:00
2025-09-29 16:03:06,897 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching jobs: repo=pytorch/pytorch workflows=periodic lookback=256h
2025-09-29 16:03:49,456 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Jobs fetched: 2887 rows in 42.56s
2025-09-29 16:03:49,466 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Fetching tests for 1113 job_ids (454 failed jobs) in batches
2025-09-29 16:03:49,466 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 1/2 (size=1024)
2025-09-29 16:03:51,753 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Test batch 2/2 (size=89)
2025-09-29 16:03:53,056 INFO [pytorch_auto_revert.signal_extraction_datasource] [extract] Tests fetched: 5002 rows for 1113 job_ids in 3.59s
2025-09-29 16:03:53,122 INFO [root] [v2] Extracted 144 signals
```

v20250929-192231

29 Sep 19:24
b6d478f
Compare
Choose a tag to compare
[autorevert] fix local cli (#7244)

Before:

```
(venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % python -m pytorch_auto_revert hud
2025-09-29 12:12:37,159 WARNING [pytorch_auto_revert.clickhouse_client_helper] Connection test failed: HTTPDriver for https://hyt81izu0c.us-east-1.aws.clickhouse.cloud:8443 received ClickHouse error code 516
 Code: 516. DB::Exception: revert_lambda: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6151 (official build))

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/ivanzaitsev/test-infra/aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py", line 336, in <module>
    main()
  File "/Users/ivanzaitsev/test-infra/aws/lambda/pytorch-auto-revert/pytorch_auto_revert/__main__.py", line 283, in main
    raise RuntimeError(
RuntimeError: ClickHouse connection test failed. Please check your configuration.
```

After:
```
(venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert %
(venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert %
(venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert %
(venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert %
(venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert % python -m pytorch_auto_revert hud
2025-09-29 12:18:23,118 INFO [root] [hud] Fetching run state ts=2025-09-29 19:13:18 repo=<any>
2025-09-29 12:18:23,521 INFO [root] [hud] Loaded state for repo=pytorch/pytorch workflows=Lint,trunk,pull,inductor
2025-09-29 12:18:23,521 INFO [root] [hud] Rendering HTML for repo=pytorch/pytorch workflows=Lint,trunk,pull,inductor lookback=16 → 2025-09-29_19-13-18.html
2025-09-29 12:18:23,523 INFO [root] HUD written to 2025-09-29_19-13-18.html
(venv) ivanzaitsev@ivanzaitsev-mbp pytorch-auto-revert %
```

v20250929-184550

29 Sep 18:47
60e16ae
Compare
Choose a tag to compare
[PYTORCHBOT] adds 'autorevert' classification for reverts (#7242)

Autorevert should issue revert commands in the format `&pytorchbot
revert -m "message" -c autorevert`

this change enables pytorchbot to accept this classification

v20250929-182904

29 Sep 18:30
73efcae
Compare
Choose a tag to compare
[autorevert] fix RetryWithBackoff, add tests (#7243)

a followup to https://github.com/pytorch/test-infra/pull/7241

fixes the logic and adds unit tests

v20250929-161641

29 Sep 16:18
aa5c240
Compare
Choose a tag to compare
[AUTOREVERT] use secret store over environment variables for password…

v20250929-155929

29 Sep 16:01
6ec1bf7
Compare
Choose a tag to compare
[AUTOREVERT] Add retry with back-off for GH API and CH (#7241)

Just going on the code, finding where we call external API, and adding a
retry with exponential back-off.

Defaults to 5 retries, 0.5s base and with 10% jitter

There are NO CODE CHANGES, all parts of the code that are relevant are
being guardrailed with:

```
for attempt in RetryWithBackoff():
    with attempt:
        # the code 
```

Changes appear to be big due:

* Extra tabs and the consequent linter changes
* Lazy nature of the gh and ch libraries, that resolve pagination as the
code consume information

v20250929-124114

29 Sep 12:42
06985bf
Compare
Choose a tag to compare
[autorevert] fix handling for insufficient successes (#7235)

Previously the code was trying to group branches for restarts resulting
from "infra check" and from "insufficient events", and this was a
mistake, resulting in delayed restarts.

Specifically, in this situation:
<img width="999" height="747" alt="image"
src="https://github.com/user-attachments/assets/9cd0051e-8d87-4fe2-af90-88a776847c4d"
/>
a restart on the success side is expected, but the system waits for
pending job on the failure side.


This PR decouples and simplifies the logic. Now, all restarts are
scheduled independently (relying on set deduplication) and all final
checks are performed afterwards.

Added a unit test to specifically verify the case above.